---
title: "Choosing the Variational Family"
subtitle: "PyApprox Tutorial Library"
description: "How the choice of variational distribution controls the expressiveness-tractability tradeoff."
tutorial_type: hands-on
topic: uncertainty_quantification
difficulty: intermediate
estimated_time: 15
render_time: 55
prerequisites:
- vi_optimization
tags:
- uq
- beam
- variational-inference
- variational-family
- gaussian
- beta
format:
html:
code-fold: false
code-tools: true
toc: true
execute:
echo: true
warning: false
jupyter: python3
---
::: {.callout-tip collapse="true"}
## Download Notebook
[Download as Jupyter Notebook](notebooks/vi_families.ipynb)
:::
## Learning Objectives
After completing this tutorial, you will be able to:
- Explain the expressiveness--tractability tradeoff in choosing a variational family
- Compare diagonal (mean-field) and full-covariance Gaussian variational distributions
- Describe the Cholesky parameterization for full-covariance Gaussians and why it ensures positive-definiteness
- Explain when a low-rank covariance parameterization is preferable to a full-covariance one
- Use a Beta variational family for bounded parameters
- Select an appropriate variational family for a given problem
## Prerequisites
Complete [Optimizing the ELBO](vi_optimization.qmd) before this tutorial.
## The Tradeoff
The [previous tutorial](vi_optimization.qmd) showed that a diagonal Gaussian captures the marginal posterior means and variances but misses correlations, and that a full-covariance Gaussian can recover them. A natural question: why not always use the most expressive family possible?
The answer is cost. A diagonal Gaussian over $d$ latent variables has $2d$ parameters ($d$ means and $d$ log-standard-deviations). A full-covariance Gaussian has $d + d(d+1)/2$ parameters (means plus a lower-triangular Cholesky factor). For $d = 100$, that is 200 vs. 5,150 parameters. The richer family captures more structure but requires more optimization effort and more base samples.
The right choice depends on what matters in the posterior: if the parameters are nearly independent, mean-field is sufficient; if the data induces strong correlations, capturing them may be essential for accurate uncertainty quantification.
## Setup
We use the 2D composite beam from the [previous tutorial](vi_optimization.qmd): a cantilever beam with uncertain skin modulus $E_1$ and core modulus $E_2$. With two correlated parameters, the limitations of each family are clearly visible.
```{python}
#| echo: false
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import math
```
```{python}
from pyapprox.util.backends.numpy import NumpyBkd
from pyapprox.benchmarks.functions.algebraic.cantilever_beam import (
CantileverBeam1DAnalytical,
)
from pyapprox.probability.likelihood import (
DiagonalGaussianLogLikelihood,
ModelBasedLogLikelihood,
)
from pyapprox.probability.likelihood.gaussian import (
MultiExperimentLogLikelihood,
)
from pyapprox.interface.functions.fromcallable.function import (
FunctionFromCallable,
)
from pyapprox.inverse.posterior.log_unnormalized import (
LogUnNormalizedPosterior,
)
from pyapprox.probability.joint.independent import IndependentJoint
from pyapprox.probability.univariate.gaussian import GaussianMarginal
bkd = NumpyBkd()
# Composite beam: (E1_skin, E2_core) -> tip_deflection
L, H, q0 = 100.0, 30.0, 10.0
composite_beam = CantileverBeam1DAnalytical(
length=L, height=H, skin_thickness=5.0, q0=q0, bkd=bkd,
)
tip_model_2d = FunctionFromCallable(
nqoi=1, nvars=2,
fun=lambda E: composite_beam(E)[0:1, :],
bkd=bkd,
)
# Independent Gaussian priors on E1 and E2
mu_prior_2d = np.array([12_000.0, 10_000.0])
sigma_prior_2d = np.array([2_000.0, 2_000.0])
prior_2d = IndependentJoint(
[GaussianMarginal(mean=mu_prior_2d[i], stdev=sigma_prior_2d[i], bkd=bkd)
for i in range(2)],
bkd,
)
# Standardized model: z -> tip_deflection
std_tip_model_2d = FunctionFromCallable(
nqoi=1, nvars=2,
fun=lambda z: tip_model_2d(
bkd.asarray(mu_prior_2d[:, None])
+ bkd.asarray(sigma_prior_2d[:, None]) * z
),
bkd=bkd,
)
# True parameters and observation
E1_true, E2_true = 10_000.0, 8_000.0
E_true_2d = bkd.asarray([[E1_true], [E2_true]])
sigma_noise_2d = 0.4
noise_lik_2d = DiagonalGaussianLogLikelihood(
bkd.asarray([sigma_noise_2d**2]), bkd,
)
model_lik_2d = ModelBasedLogLikelihood(tip_model_2d, noise_lik_2d, bkd)
np.random.seed(7)
y_obs_2d = float(model_lik_2d.rvs(E_true_2d)[0, 0])
noise_lik_2d.set_observations(bkd.asarray([[y_obs_2d]]))
# Exact posterior on grid for comparison
posterior_fn = LogUnNormalizedPosterior(
model_fn=tip_model_2d, likelihood=noise_lik_2d, prior=prior_2d, bkd=bkd,
)
n_g = 80
E1_g = np.linspace(4_000, 20_000, n_g)
E2_g = np.linspace(2_000, 18_000, n_g)
E1_m, E2_m = np.meshgrid(E1_g, E2_g)
grid_pts = bkd.asarray(np.vstack([E1_m.ravel(), E2_m.ravel()]))
lp_flat = bkd.to_numpy(posterior_fn(grid_pts))
lp_grid = lp_flat.reshape(n_g, n_g)
p_grid = np.exp(lp_grid - np.max(lp_grid))
print(f"True parameters: E1 = {E1_true:.0f}, E2 = {E2_true:.0f}")
print(f"Observed tip deflection: {y_obs_2d:.6f}")
```
## Family 1: Diagonal (Mean-Field) Gaussian
The mean-field family assumes independence between all latent variables:
$$
q(E_1, E_2) = q_1(E_1) \cdot q_2(E_2) = \mathcal{N}(E_1;\, \mu_1, \sigma_1^2) \cdot \mathcal{N}(E_2;\, \mu_2, \sigma_2^2)
$$
This is what `ConditionalIndependentJoint` with per-dimension `ConditionalGaussian` components provides. It has **4 parameters**: $(\mu_1, \log\sigma_1, \mu_2, \log\sigma_2)$.
```{python}
from pyapprox.probability.conditional.gaussian import ConditionalGaussian
from pyapprox.probability.conditional.joint import ConditionalIndependentJoint
from pyapprox.probability.univariate import UniformMarginal
from pyapprox.surrogates.affine.basis import OrthonormalPolynomialBasis
from pyapprox.surrogates.affine.expansions import BasisExpansion
from pyapprox.surrogates.affine.indices import compute_hyperbolic_indices
from pyapprox.surrogates.affine.univariate import create_bases_1d
from pyapprox.inverse.variational.elbo import make_single_problem_elbo
from pyapprox.inverse.variational.fitter import VariationalFitter
from pyapprox.optimization.minimize.scipy.trust_constr import (
ScipyTrustConstrOptimizer,
)
def _make_degree0_expansion(bkd, coeff=0.0):
marginals = [UniformMarginal(-1.0, 1.0, bkd)]
bases_1d = create_bases_1d(marginals, bkd)
indices = compute_hyperbolic_indices(1, 0, 1.0, bkd)
basis = OrthonormalPolynomialBasis(bases_1d, bkd, indices)
exp = BasisExpansion(basis, bkd, nqoi=1)
exp.set_coefficients(bkd.asarray([[coeff]]))
return exp
def build_diagonal_gaussian(bkd, ndim):
"""Build a diagonal (mean-field) Gaussian variational distribution."""
conds = []
for _ in range(ndim):
mf = _make_degree0_expansion(bkd, 0.0)
lsf = _make_degree0_expansion(bkd, 0.0)
conds.append(ConditionalGaussian(mf, lsf, bkd))
var_dist = ConditionalIndependentJoint(conds, bkd)
return var_dist, conds
# Build diagonal Gaussian VI (in standardized space)
var_dist_diag, conds_diag = build_diagonal_gaussian(bkd, 2)
vi_prior = IndependentJoint(
[GaussianMarginal(0.0, 1.0, bkd) for _ in range(2)], bkd,
)
base_lik = DiagonalGaussianLogLikelihood(
bkd.asarray([sigma_noise_2d**2]), bkd,
)
multi_lik = MultiExperimentLogLikelihood(
base_lik, bkd.asarray([[y_obs_2d]]), bkd,
)
def log_lik_2d(z):
return multi_lik.logpdf(std_tip_model_2d(z))
nsamples = 300
np.random.seed(42)
base_samples = bkd.asarray(np.random.normal(0, 1, (2, nsamples)))
weights = bkd.full((1, nsamples), 1.0 / nsamples)
elbo_diag = make_single_problem_elbo(
var_dist_diag, log_lik_2d, vi_prior, base_samples, weights, bkd,
)
opt = ScipyTrustConstrOptimizer(maxiter=100, gtol=1e-8, verbosity=0)
fitter = VariationalFitter(bkd, optimizer=opt)
result_diag = fitter.fit(elbo_diag)
print(f"Diagonal Gaussian: {elbo_diag.nvars()} parameters")
print(f"Negative ELBO: {result_diag.neg_elbo():.4f}")
```
## Family 2: Full-Covariance Gaussian
To capture correlations, we replace the diagonal covariance with a full covariance matrix. The covariance must be symmetric positive-definite, which we enforce via the **Cholesky parameterization**: store the lower-triangular factor $L$ such that $\Sigma = L L^\top$. Specifically, we parameterize the log of the diagonal entries to ensure positivity:
$$
L = \begin{pmatrix} e^{\ell_{11}} & 0 \\ \ell_{21} & e^{\ell_{22}} \end{pmatrix}, \qquad \Sigma = L L^\top
$$
This is exactly what `ConditionalDenseCholGaussian` implements. For $d = 2$, the parameterization has $2 + 2 + 1 = 5$ parameters (2 means, 2 log-diagonal entries, 1 off-diagonal entry).
```{python}
from pyapprox.probability.conditional.multivariate_gaussian import (
ConditionalDenseCholGaussian,
)
from pyapprox.probability.gaussian.dense import (
DenseCholeskyMultivariateGaussian,
)
from pyapprox.surrogates.affine.expansions.pce import (
create_pce_from_marginals,
)
def _make_expansion_nd(bkd, nvars_in, degree, nqoi, coeff=0.0):
"""Create a BasisExpansion with given degree, input dim, and nqoi."""
marginals = [UniformMarginal(-1.0, 1.0, bkd) for _ in range(nvars_in)]
exp = create_pce_from_marginals(marginals, degree, bkd, nqoi=nqoi)
nterms = exp.nterms()
coeffs = np.zeros((nterms, nqoi))
coeffs[0, :] = coeff
exp.set_coefficients(bkd.asarray(coeffs))
return exp
def build_dense_chol_gaussian(bkd, d, nvars_in=1, degree=0):
"""Build a full-covariance Gaussian via Cholesky parameterization."""
mean_func = _make_expansion_nd(bkd, nvars_in, degree, nqoi=d)
log_chol_diag_func = _make_expansion_nd(bkd, nvars_in, degree, nqoi=d)
n_offdiag = d * (d - 1) // 2
chol_offdiag_func = None
if d > 1:
chol_offdiag_func = _make_expansion_nd(
bkd, nvars_in, degree, nqoi=n_offdiag,
)
return ConditionalDenseCholGaussian(
mean_func, log_chol_diag_func, chol_offdiag_func, bkd,
)
var_dist_full = build_dense_chol_gaussian(bkd, d=2)
prior_full = DenseCholeskyMultivariateGaussian(
bkd.zeros((2, 1)), bkd.eye(2), bkd,
)
elbo_full = make_single_problem_elbo(
var_dist_full, log_lik_2d, prior_full, base_samples, weights, bkd,
)
fitter_full = VariationalFitter(bkd, optimizer=opt)
result_full = fitter_full.fit(elbo_full)
print(f"Full-covariance Gaussian: {elbo_full.nvars()} parameters")
print(f"Negative ELBO: {result_full.neg_elbo():.4f}")
```
@fig-family-comparison compares the two Gaussian families on the 2D composite beam posterior.
```{python}
#| echo: false
#| fig-cap: "Diagonal vs. full-covariance Gaussian VI on the 2D composite beam posterior. Left: the diagonal (mean-field) Gaussian has axis-aligned ellipses --- it cannot represent the posterior's tilt. Right: the full-covariance Gaussian tilts its ellipses to match the posterior correlation, producing a substantially better fit and a lower negative ELBO."
#| label: fig-family-comparison
from pyapprox.tutorial_figures._vi import plot_family_comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
plot_family_comparison(
E1_m, E2_m, p_grid, E1_true, E2_true,
conds_diag, var_dist_full,
mu_prior_2d, sigma_prior_2d,
result_diag, result_full, bkd, ax1, ax2,
)
plt.tight_layout()
plt.show()
```
## Family 3: Low-Rank Covariance
The full Cholesky parameterization stores $d(d+1)/2$ entries. For large $d$, this becomes expensive. The **low-rank** parameterization approximates the covariance as:
$$
\Sigma = D^2 + V V^\top
$$
where $D = \mathrm{diag}(e^{\ell_1}, \ldots, e^{\ell_d})$ is a diagonal matrix and $V$ is a $d \times r$ factor with $r \ll d$. This captures the $r$ most important correlation directions using only $d + dr$ parameters instead of $d(d+1)/2$.
In PyApprox, `ConditionalLowRankCholGaussian` implements this parameterization. For the 2D problem ($d = 2$, $r = 2$), low-rank and full-covariance have the same capacity; the benefit appears for $d \gg 2$.
```{python}
from pyapprox.probability.conditional.multivariate_gaussian import (
ConditionalLowRankCholGaussian,
)
def build_low_rank_gaussian(bkd, d, rank, nvars_in=1, degree=0):
"""Build a low-rank covariance Gaussian."""
mean_func = _make_expansion_nd(bkd, nvars_in, degree, nqoi=d)
log_diag_func = _make_expansion_nd(bkd, nvars_in, degree, nqoi=d)
# Set lower bounds on log_diag to prevent numerical singularity
for hp in log_diag_func.hyp_list().hyperparameters():
bounds = hp.get_bounds()
lower = bkd.full((bounds.shape[0],), -6.0)
new_bounds = bkd.stack([lower, bounds[:, 1]], axis=1)
hp.set_bounds(new_bounds)
factor_func = None
if rank > 0:
factor_func = _make_expansion_nd(
bkd, nvars_in, degree, nqoi=d * rank,
)
return ConditionalLowRankCholGaussian(
mean_func, log_diag_func, factor_func, rank, bkd,
)
var_dist_lr = build_low_rank_gaussian(bkd, d=2, rank=2)
elbo_lr = make_single_problem_elbo(
var_dist_lr, log_lik_2d, prior_full, base_samples, weights, bkd,
)
fitter_lr = VariationalFitter(bkd, optimizer=opt)
result_lr = fitter_lr.fit(elbo_lr)
print(f"Low-rank Gaussian (rank=2): {elbo_lr.nvars()} parameters")
print(f"Negative ELBO: {result_lr.neg_elbo():.4f}")
```
::: {.callout-tip}
## When to use low-rank
The low-rank parameterization is most useful when $d$ is large (tens to hundreds) and the posterior correlation structure is approximately low-dimensional --- for example, when observations constrain a few linear combinations of the parameters much more tightly than others. For $d = 2$ or $d = 3$, the full Cholesky parameterization is usually preferable.
:::
## Family 4: Beta for Bounded Parameters
Not all parameters are unbounded. For a probability $p \in (0, 1)$, a Gaussian variational distribution is a poor choice because it places mass outside the valid range. A **Beta** variational family naturally respects the bounds.
To illustrate, consider the classic Beta-Bernoulli problem: we observe coin flips and want to infer the probability of heads $p$. The prior is $p \sim \text{Beta}(2, 2)$ (mildly favoring fairness), and we observe $[1, 1, 0, 1, 0]$ (3 heads, 2 tails). The exact conjugate posterior is $\text{Beta}(5, 4)$.
```{python}
from pyapprox.probability.conditional.beta import ConditionalBeta
from pyapprox.probability.univariate.beta import BetaMarginal
from pyapprox.inverse.conjugate.beta import BetaConjugatePosterior
# Exact conjugate posterior
prior_alpha, prior_beta = 2.0, 2.0
obs_list = [1.0, 1.0, 0.0, 1.0, 0.0]
conjugate = BetaConjugatePosterior(prior_alpha, prior_beta, bkd)
conjugate.compute(bkd.asarray([obs_list]))
exact_posterior_alpha = prior_alpha + sum(obs_list)
exact_posterior_beta = prior_beta + len(obs_list) - sum(obs_list)
print(f"Exact posterior: Beta({exact_posterior_alpha}, {exact_posterior_beta})")
print(f"Exact posterior mean: {exact_posterior_alpha / (exact_posterior_alpha + exact_posterior_beta):.4f}")
```
```{python}
# VI with Beta variational family
log_alpha_func = _make_degree0_expansion(bkd, math.log(prior_alpha))
log_beta_func = _make_degree0_expansion(bkd, math.log(prior_beta))
cond_beta = ConditionalBeta(log_alpha_func, log_beta_func, bkd)
prior_marginal = BetaMarginal(prior_alpha, prior_beta, bkd)
obs = bkd.asarray([obs_list])
def bernoulli_log_lik(z):
"""Bernoulli log-likelihood: sum_i obs_i*log(p) + (1-obs_i)*log(1-p)."""
p = bkd.clip(z, 1e-8, 1.0 - 1e-8)
n_success = bkd.sum(obs)
n_fail = obs.shape[1] - n_success
return n_success * bkd.log(p) + n_fail * bkd.log(1.0 - p)
np.random.seed(42)
base_samples_beta = bkd.asarray(np.random.uniform(0.01, 0.99, (1, 500)))
weights_beta = bkd.full((1, 500), 1.0 / 500)
elbo_beta = make_single_problem_elbo(
cond_beta, bernoulli_log_lik, prior_marginal,
base_samples_beta, weights_beta, bkd,
)
fitter_beta = VariationalFitter(bkd)
result_beta = fitter_beta.fit(elbo_beta)
# Extract recovered parameters
dummy_x = bkd.zeros((cond_beta.nvars(), 1))
rec_alpha = math.exp(float(cond_beta._log_alpha_func(dummy_x)[0, 0]))
rec_beta = math.exp(float(cond_beta._log_beta_func(dummy_x)[0, 0]))
rec_mean = rec_alpha / (rec_alpha + rec_beta)
print(f"VI posterior: Beta({rec_alpha:.2f}, {rec_beta:.2f})")
print(f"VI posterior mean: {rec_mean:.4f}")
```
@fig-beta-vi shows the VI Beta approximation compared to the exact conjugate posterior.
```{python}
#| echo: false
#| fig-cap: "Beta VI on the coin-flip problem. The VI approximation (blue dashed) closely matches the exact conjugate posterior (orange). The Beta variational family naturally respects the parameter bounds $p \\in (0, 1)$."
#| label: fig-beta-vi
from pyapprox.tutorial_figures._vi import plot_beta_vi
fig, ax = plt.subplots(figsize=(9, 4.5))
plot_beta_vi(prior_alpha, prior_beta, exact_posterior_alpha,
exact_posterior_beta, rec_alpha, rec_beta, ax)
plt.tight_layout()
plt.show()
```
## Choosing a Family: Summary
The following guide summarizes when to use each family:
| Family | Parameters | Captures | Best for |
|---|---|---|---|
| Diagonal Gaussian | $2d$ | Marginals only | Independent or weakly correlated parameters |
| Full-covariance Gaussian | $d + d(d{+}1)/2$ | Full correlation | Moderate $d$ with strong correlations |
| Low-rank Gaussian | $d + dr$ | Top-$r$ correlations | Large $d$ with low-rank correlation structure |
| Beta | $2$ per dim | Bounded, possibly skewed | Probability parameters or bounded quantities |
When in doubt, start with the diagonal Gaussian for a quick baseline, then upgrade to full-covariance if the negative ELBO improves substantially.
::: {.callout-important}
## Validating the approximation
A lower negative ELBO means a better approximation (closer to the true posterior). Use the ELBO to compare families: if switching from diagonal to full-covariance reduces the negative ELBO significantly, the correlation structure matters and the richer family is worthwhile.
:::
## Key Takeaways
- The **diagonal (mean-field) Gaussian** is fast but ignores correlations; use it as a baseline or when parameters are approximately independent
- The **full-covariance Gaussian** via Cholesky parameterization captures all pairwise correlations; the Cholesky factor ensures positive-definiteness by construction
- The **low-rank covariance** parameterization ($\Sigma = D^2 + VV^\top$) scales to high dimensions by capturing only the top-$r$ correlation directions
- A **Beta** variational family naturally handles bounded parameters and can recover conjugate posteriors for Bernoulli likelihoods
- Compare families using the **negative ELBO**: a significant decrease indicates the richer family captures important posterior structure
## Exercises
1. On the 2D composite beam, compare the negative ELBO across all three Gaussian families (diagonal, full-covariance, low-rank with $r = 1$). How much does full-covariance improve over diagonal? How does low-rank with $r = 1$ compare?
2. Change the noise standard deviation in the 2D beam to $\sigma = 2.0$ (5x larger). Does the posterior become less correlated? Does the gap between diagonal and full-covariance ELBO shrink?
3. For the Beta-Bernoulli problem, change the observations to $[1, 1, 1, 1, 1, 1, 1, 1, 0, 0]$ (8 heads, 2 tails). Verify that VI still recovers the correct conjugate posterior $\text{Beta}(10, 4)$.
4. **(Challenge)** Extend the composite beam to 3 uncertain moduli by adding a third layer. Compare the number of parameters and optimization time for diagonal vs. full-covariance vs. low-rank ($r = 1$) Gaussians. At what dimension does low-rank become advantageous?
## Next Steps
Continue with:
- [Amortized Variational Inference](vi_amortized.qmd) --- Learning to infer posteriors for multiple datasets simultaneously