Bayesian OED: The Double-Loop Estimator
PyApprox Tutorial Library
Learning Objectives
After completing this tutorial, you will be able to:
- State the mutual-information form of EIG and identify the term that makes it expensive to evaluate
- Explain why the inner loop approximates the marginal evidence and what determines its accuracy
- Describe the joint sampling requirement for the outer loop
- Trace data flow through the double-loop algorithm and identify which arrays must be reused across loops
Prerequisites
Complete Learning from Experiments: Bayesian OED before this tutorial.
Problem Setup
We observe \(K\) candidate measurements. Each sensor \(k\) has a design weight \(w_k\) that scales its effective precision. The observation model is:
\[ y_k \mid \boldsymbol{\theta}, w_k \;\sim\; \mathcal{N}\!\left(h_k(\boldsymbol{\theta}),\; \gamma_k / w_k\right), \]
where \(h_k(\boldsymbol{\theta}) = [\mathbf{f}(\boldsymbol{\theta})]_k\) is the forward model at sensor \(k\) and \(\gamma_k\) is the base noise variance. The \(K\)-sensor joint log-likelihood is:
\[ \log p(\mathbf{y} \mid \boldsymbol{\theta}, \mathbf{w}) = -\tfrac{1}{2}\,\mathbf{r}^\top \mathbf{W}^{1/2} \boldsymbol{\Gamma}^{-1} \mathbf{W}^{1/2} \mathbf{r} + C_{\mathbf{w}}, \]
where \(\mathbf{r} = \mathbf{f}(\boldsymbol{\theta}) - \mathbf{y}\), \(\mathbf{W} = \mathrm{Diag}[\mathbf{w}]\), \(\boldsymbol{\Gamma} = \mathrm{Diag}[\boldsymbol{\gamma}]\), and
\[ C_{\mathbf{w}} = \tfrac{1}{2}\log\lvert\mathbf{W}^{1/2}\boldsymbol{\Gamma}^{-1}\mathbf{W}^{1/2}\rvert = \tfrac{1}{2}\sum_{k=1}^{K}\log(w_k/\gamma_k). \]
The weights satisfy \(\mathbf{w} \in [0,1]^K\) with \(\sum_k w_k = 1\).
Reparameterized Outer Samples
Because \(w_k\) sets the noise variance to \(\gamma_k/w_k\), the correct way to generate an outer observation is:
\[ \mathbf{y}^{(m)}(\mathbf{w}) = \mathbf{f}(\boldsymbol{\theta}^{(m)}) + \mathbf{W}^{-1/2}\boldsymbol{\Gamma}^{1/2}\boldsymbol{\varepsilon}^{(m)}, \quad \boldsymbol{\varepsilon}^{(m)} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_K). \]
Drawing from \(\mathcal{N}(\mathbf{0}, \mathbf{I}_K)\) and scaling by \(\mathbf{W}^{-1/2}\boldsymbol{\Gamma}^{1/2}\) is the reparameterization of the observation. It keeps the randomness in a fixed standard-normal draw while making \(\mathbf{y}^{(m)}\) a differentiable function of \(\mathbf{w}\). Why this matters for gradients is covered in Bayesian OED: Gradients and the Reparameterization Trick; for estimation alone, the key point is to use \(\mathbf{W}^{-1/2}\boldsymbol{\Gamma}^{1/2}\) scaling, not raw \(\boldsymbol{\Gamma}^{1/2}\), when drawing outer noise.
The Double-Loop Estimator
Why an Inner Loop?
The mutual information form of EIG is:
\[ \text{EIG}(\mathbf{w}) = \mathbb{E}_{p(\boldsymbol{\theta},\mathbf{y}\mid\mathbf{w})} \!\left[ \log p(\mathbf{y}\mid\boldsymbol{\theta},\mathbf{w}) - \log p(\mathbf{y}\mid\mathbf{w}) \right]. \]
The first term — the log-likelihood — is cheap to evaluate for any \((\boldsymbol{\theta}, \mathbf{y})\) pair. The second term — the log marginal evidence — requires integrating out all parameters:
\[ p(\mathbf{y}\mid\mathbf{w}) = \int p(\mathbf{y}\mid\boldsymbol{\theta},\mathbf{w})\,p(\boldsymbol{\theta})\,d\boldsymbol{\theta}. \]
This integral is analytically intractable for nonlinear forward models. The inner loop approximates it by averaging over \(N_\text{in}\) freshly drawn prior samples:
\[ p(\mathbf{y}\mid\mathbf{w}) \approx \frac{1}{N_\text{in}}\sum_{n=1}^{N_\text{in}} p(\mathbf{y}\mid\boldsymbol{\theta}^{(n)},\mathbf{w}). \]
The Estimator
Substituting the inner-loop approximation and averaging over \(M\) outer samples:
\[ \widehat{\text{EIG}}(\mathbf{w}) = \frac{1}{M} \sum_{m=1}^{M} \left\{ \log p\!\left(\mathbf{y}^{(m)} \mid \boldsymbol{\theta}^{(m)}, \mathbf{w}\right) - \log\!\left[ \frac{1}{N_\text{in}} \sum_{n=1}^{N_\text{in}} p\!\left(\mathbf{y}^{(m)} \mid \boldsymbol{\theta}^{(n)}, \mathbf{w}\right) \right] \right\}. \]
The outer samples \((\boldsymbol{\theta}^{(m)}, \boldsymbol{\varepsilon}^{(m)})\) are drawn jointly from the model. The inner samples \(\boldsymbol{\theta}^{(n)}\) are drawn independently from the same prior. The observation \(\mathbf{y}^{(m)}\) is reused across all \(N_\text{in}\) inner likelihood evaluations — this is what makes the estimator a nested loop rather than two independent sweeps.
Computational Steps
Outer loop (\(M\) samples):
Draw \(\boldsymbol{\theta}^{(m)} \sim p(\boldsymbol{\theta})\) for \(m=1,\ldots,M\).
Evaluate \(\mathbf{F} = [\mathbf{f}(\boldsymbol{\theta}^{(1)}), \ldots, \mathbf{f}(\boldsymbol{\theta}^{(M)})]\).
Draw \(\boldsymbol{\mathcal{E}} \in \mathbb{R}^{K \times M}\) with columns \(\sim \mathcal{N}(\mathbf{0}, \mathbf{I}_K)\) and form \(\mathbf{Y} = \mathbf{F} + \mathbf{W}^{-1/2}\boldsymbol{\Gamma}^{1/2}\boldsymbol{\mathcal{E}}\).
Compute per-sample log-likelihood via Cholesky:
- \(\mathbf{R} = \mathbf{Y} - \mathbf{F}\)
- Factorize \(\boldsymbol{\Gamma} = \mathbf{L}\mathbf{L}^\top\)
- \(\mathbf{P} = \mathbf{L}^{-1} \mathbf{W}^{1/2} \mathbf{R}\)
- \(\ell_m = C_{\mathbf{w}} - \tfrac{1}{2}[\mathbf{P}^\top\mathbf{P}]_{mm}\)
Inner loop (\(N_\text{in}\) samples, reusing \(\mathbf{Y}\)):
- Draw \(\boldsymbol{\theta}^{(n)} \sim p(\boldsymbol{\theta})\) for \(n=1,\ldots,N_\text{in}\) (independent of the outer draws).
- Evaluate \(\mathbf{F}^\star\).
- Compute residuals \(R^\star_{k,m,n} = Y_{k,m} - F^\star_{k,n}\) for all \((m,n)\).
- Compute log-likelihoods, then accumulate the inner evidence via stable log-sum-exp:
\[ \log\hat{p}(\mathbf{y}^{(m)}\mid\mathbf{w}) = \log\!\sum_{n=1}^{N_\text{in}}\exp(\ell_n^{(m)}) - \log N_\text{in}. \]
Final estimate: \(\widehat{\text{EIG}} = M^{-1}\sum_m [\ell_m - \log\hat{p}(\mathbf{y}^{(m)}\mid\mathbf{w})]\).
Direct summation in probability space underflows to zero at realistic sample sizes. Always compute the inner average as \(\log(\text{sum of exp})\) then subtract \(\log N_\text{in}\), using the log-sum-exp trick for stability.
Key Takeaways
- The inner loop exists solely to approximate the intractable marginal evidence \(p(\mathbf{y}\mid\mathbf{w})\); the outer loop controls the Monte Carlo variance of the full EIG estimate.
- Outer observations must use the \(\mathbf{w}\)-dependent noise scaling: \(\mathbf{y}^{(m)} = \mathbf{f}(\boldsymbol{\theta}^{(m)}) + \mathbf{W}^{-1/2}\boldsymbol{\Gamma}^{1/2}\boldsymbol{\varepsilon}^{(m)}\).
- The observation \(\mathbf{y}^{(m)}\) is reused across all \(N_\text{in}\) inner evaluations; inner samples are drawn independently.
- Log-sum-exp is required when accumulating the inner average.
- Convergence depends on both \(M\) (variance) and \(N_\text{in}\) (bias); see Bayesian OED: Verifying Convergence for empirical rates.
Exercises
In step 8, why must log-sum-exp be computed over axis \(n\) (inner samples) rather than axis \(m\) (outer samples)?
Suppose the forward model is linear, \(\mathbf{f}(\boldsymbol{\theta}) = \mathbf{A}\boldsymbol{\theta}\), and the prior is Gaussian. Show that the inner-loop average converges to the exact marginal as \(N_\text{in}\to\infty\) and write down what \(p(\mathbf{y}\mid\mathbf{w})\) equals analytically.
The estimator is biased downward for finite \(N_\text{in}\). In which direction does the bias go as \(N_\text{in}\to\infty\), and why?
What goes wrong if inner samples \(\boldsymbol{\theta}^{(n)}\) are reused from the outer loop instead of drawn independently?
Next Steps
- Bayesian OED: Gradients and the Reparameterization Trick — why \(\partial\mathbf{y}/\partial\mathbf{w} \neq 0\) and how this changes the gradient
- Bayesian OED: Verifying Convergence — measuring MSE, bias, and variance with
KLOEDDiagnostics