Choosing the Variational Family

PyApprox Tutorial Library

How the choice of variational distribution controls the expressiveness-tractability tradeoff.

Learning Objectives

After completing this tutorial, you will be able to:

  • Explain the expressiveness–tractability tradeoff in choosing a variational family
  • Compare diagonal (mean-field) and full-covariance Gaussian variational distributions
  • Describe the Cholesky parameterization for full-covariance Gaussians and why it ensures positive-definiteness
  • Explain when a low-rank covariance parameterization is preferable to a full-covariance one
  • Use a Beta variational family for bounded parameters
  • Select an appropriate variational family for a given problem

Prerequisites

Complete Optimizing the ELBO before this tutorial.

The Tradeoff

The previous tutorial showed that a diagonal Gaussian captures the marginal posterior means and variances but misses correlations, and that a full-covariance Gaussian can recover them. A natural question: why not always use the most expressive family possible?

The answer is cost. A diagonal Gaussian over \(d\) latent variables has \(2d\) parameters (\(d\) means and \(d\) log-standard-deviations). A full-covariance Gaussian has \(d + d(d+1)/2\) parameters (means plus a lower-triangular Cholesky factor). For \(d = 100\), that is 200 vs. 5,150 parameters. The richer family captures more structure but requires more optimization effort and more base samples.

The right choice depends on what matters in the posterior: if the parameters are nearly independent, mean-field is sufficient; if the data induces strong correlations, capturing them may be essential for accurate uncertainty quantification.

Setup

We use the 2D composite beam from the previous tutorial: a cantilever beam with uncertain skin modulus \(E_1\) and core modulus \(E_2\). With two correlated parameters, the limitations of each family are clearly visible.

from pyapprox.util.backends.numpy import NumpyBkd
from pyapprox.benchmarks.functions.algebraic.cantilever_beam import (
    CantileverBeam1DAnalytical,
)
from pyapprox.probability.likelihood import (
    DiagonalGaussianLogLikelihood,
    ModelBasedLogLikelihood,
)
from pyapprox.probability.likelihood.gaussian import (
    MultiExperimentLogLikelihood,
)
from pyapprox.interface.functions.fromcallable.function import (
    FunctionFromCallable,
)
from pyapprox.inverse.posterior.log_unnormalized import (
    LogUnNormalizedPosterior,
)
from pyapprox.probability.joint.independent import IndependentJoint
from pyapprox.probability.univariate.gaussian import GaussianMarginal

bkd = NumpyBkd()

# Composite beam: (E1_skin, E2_core) -> tip_deflection
L, H, q0 = 100.0, 30.0, 10.0
composite_beam = CantileverBeam1DAnalytical(
    length=L, height=H, skin_thickness=5.0, q0=q0, bkd=bkd,
)

tip_model_2d = FunctionFromCallable(
    nqoi=1, nvars=2,
    fun=lambda E: composite_beam(E)[0:1, :],
    bkd=bkd,
)

# Independent Gaussian priors on E1 and E2
mu_prior_2d = np.array([12_000.0, 10_000.0])
sigma_prior_2d = np.array([2_000.0, 2_000.0])

prior_2d = IndependentJoint(
    [GaussianMarginal(mean=mu_prior_2d[i], stdev=sigma_prior_2d[i], bkd=bkd)
     for i in range(2)],
    bkd,
)

# Standardized model: z -> tip_deflection
std_tip_model_2d = FunctionFromCallable(
    nqoi=1, nvars=2,
    fun=lambda z: tip_model_2d(
        bkd.asarray(mu_prior_2d[:, None])
        + bkd.asarray(sigma_prior_2d[:, None]) * z
    ),
    bkd=bkd,
)

# True parameters and observation
E1_true, E2_true = 10_000.0, 8_000.0
E_true_2d = bkd.asarray([[E1_true], [E2_true]])
sigma_noise_2d = 0.4
noise_lik_2d = DiagonalGaussianLogLikelihood(
    bkd.asarray([sigma_noise_2d**2]), bkd,
)
model_lik_2d = ModelBasedLogLikelihood(tip_model_2d, noise_lik_2d, bkd)

np.random.seed(7)
y_obs_2d = float(model_lik_2d.rvs(E_true_2d)[0, 0])
noise_lik_2d.set_observations(bkd.asarray([[y_obs_2d]]))

# Exact posterior on grid for comparison
posterior_fn = LogUnNormalizedPosterior(
    model_fn=tip_model_2d, likelihood=noise_lik_2d, prior=prior_2d, bkd=bkd,
)

n_g = 80
E1_g = np.linspace(4_000, 20_000, n_g)
E2_g = np.linspace(2_000, 18_000, n_g)
E1_m, E2_m = np.meshgrid(E1_g, E2_g)
grid_pts = bkd.asarray(np.vstack([E1_m.ravel(), E2_m.ravel()]))
lp_flat = bkd.to_numpy(posterior_fn(grid_pts))
lp_grid = lp_flat.reshape(n_g, n_g)
p_grid = np.exp(lp_grid - np.max(lp_grid))

print(f"True parameters: E1 = {E1_true:.0f}, E2 = {E2_true:.0f}")
print(f"Observed tip deflection: {y_obs_2d:.6f}")
True parameters: E1 = 10000, E2 = 8000
Observed tip deflection: 5.377065

Family 1: Diagonal (Mean-Field) Gaussian

The mean-field family assumes independence between all latent variables:

\[ q(E_1, E_2) = q_1(E_1) \cdot q_2(E_2) = \mathcal{N}(E_1;\, \mu_1, \sigma_1^2) \cdot \mathcal{N}(E_2;\, \mu_2, \sigma_2^2) \]

This is what ConditionalIndependentJoint with per-dimension ConditionalGaussian components provides. It has 4 parameters: \((\mu_1, \log\sigma_1, \mu_2, \log\sigma_2)\).

from pyapprox.probability.conditional.gaussian import ConditionalGaussian
from pyapprox.probability.conditional.joint import ConditionalIndependentJoint
from pyapprox.probability.univariate import UniformMarginal
from pyapprox.surrogates.affine.basis import OrthonormalPolynomialBasis
from pyapprox.surrogates.affine.expansions import BasisExpansion
from pyapprox.surrogates.affine.indices import compute_hyperbolic_indices
from pyapprox.surrogates.affine.univariate import create_bases_1d
from pyapprox.inverse.variational.elbo import make_single_problem_elbo
from pyapprox.inverse.variational.fitter import VariationalFitter
from pyapprox.optimization.minimize.scipy.trust_constr import (
    ScipyTrustConstrOptimizer,
)


def _make_degree0_expansion(bkd, coeff=0.0):
    marginals = [UniformMarginal(-1.0, 1.0, bkd)]
    bases_1d = create_bases_1d(marginals, bkd)
    indices = compute_hyperbolic_indices(1, 0, 1.0, bkd)
    basis = OrthonormalPolynomialBasis(bases_1d, bkd, indices)
    exp = BasisExpansion(basis, bkd, nqoi=1)
    exp.set_coefficients(bkd.asarray([[coeff]]))
    return exp


def build_diagonal_gaussian(bkd, ndim):
    """Build a diagonal (mean-field) Gaussian variational distribution."""
    conds = []
    for _ in range(ndim):
        mf = _make_degree0_expansion(bkd, 0.0)
        lsf = _make_degree0_expansion(bkd, 0.0)
        conds.append(ConditionalGaussian(mf, lsf, bkd))
    var_dist = ConditionalIndependentJoint(conds, bkd)
    return var_dist, conds


# Build diagonal Gaussian VI (in standardized space)
var_dist_diag, conds_diag = build_diagonal_gaussian(bkd, 2)
vi_prior = IndependentJoint(
    [GaussianMarginal(0.0, 1.0, bkd) for _ in range(2)], bkd,
)

base_lik = DiagonalGaussianLogLikelihood(
    bkd.asarray([sigma_noise_2d**2]), bkd,
)
multi_lik = MultiExperimentLogLikelihood(
    base_lik, bkd.asarray([[y_obs_2d]]), bkd,
)

def log_lik_2d(z):
    return multi_lik.logpdf(std_tip_model_2d(z))

nsamples = 300
np.random.seed(42)
base_samples = bkd.asarray(np.random.normal(0, 1, (2, nsamples)))
weights = bkd.full((1, nsamples), 1.0 / nsamples)

elbo_diag = make_single_problem_elbo(
    var_dist_diag, log_lik_2d, vi_prior, base_samples, weights, bkd,
)

opt = ScipyTrustConstrOptimizer(maxiter=100, gtol=1e-8, verbosity=0)
fitter = VariationalFitter(bkd, optimizer=opt)
result_diag = fitter.fit(elbo_diag)

print(f"Diagonal Gaussian: {elbo_diag.nvars()} parameters")
print(f"Negative ELBO: {result_diag.neg_elbo():.4f}")
Diagonal Gaussian: 4 parameters
Negative ELBO: 2.9044

Family 2: Full-Covariance Gaussian

To capture correlations, we replace the diagonal covariance with a full covariance matrix. The covariance must be symmetric positive-definite, which we enforce via the Cholesky parameterization: store the lower-triangular factor \(L\) such that \(\Sigma = L L^\top\). Specifically, we parameterize the log of the diagonal entries to ensure positivity:

\[ L = \begin{pmatrix} e^{\ell_{11}} & 0 \\ \ell_{21} & e^{\ell_{22}} \end{pmatrix}, \qquad \Sigma = L L^\top \]

This is exactly what ConditionalDenseCholGaussian implements. For \(d = 2\), the parameterization has \(2 + 2 + 1 = 5\) parameters (2 means, 2 log-diagonal entries, 1 off-diagonal entry).

from pyapprox.probability.conditional.multivariate_gaussian import (
    ConditionalDenseCholGaussian,
)
from pyapprox.probability.gaussian.dense import (
    DenseCholeskyMultivariateGaussian,
)
from pyapprox.surrogates.affine.expansions.pce import (
    create_pce_from_marginals,
)


def _make_expansion_nd(bkd, nvars_in, degree, nqoi, coeff=0.0):
    """Create a BasisExpansion with given degree, input dim, and nqoi."""
    marginals = [UniformMarginal(-1.0, 1.0, bkd) for _ in range(nvars_in)]
    exp = create_pce_from_marginals(marginals, degree, bkd, nqoi=nqoi)
    nterms = exp.nterms()
    coeffs = np.zeros((nterms, nqoi))
    coeffs[0, :] = coeff
    exp.set_coefficients(bkd.asarray(coeffs))
    return exp


def build_dense_chol_gaussian(bkd, d, nvars_in=1, degree=0):
    """Build a full-covariance Gaussian via Cholesky parameterization."""
    mean_func = _make_expansion_nd(bkd, nvars_in, degree, nqoi=d)
    log_chol_diag_func = _make_expansion_nd(bkd, nvars_in, degree, nqoi=d)
    n_offdiag = d * (d - 1) // 2
    chol_offdiag_func = None
    if d > 1:
        chol_offdiag_func = _make_expansion_nd(
            bkd, nvars_in, degree, nqoi=n_offdiag,
        )
    return ConditionalDenseCholGaussian(
        mean_func, log_chol_diag_func, chol_offdiag_func, bkd,
    )


var_dist_full = build_dense_chol_gaussian(bkd, d=2)
prior_full = DenseCholeskyMultivariateGaussian(
    bkd.zeros((2, 1)), bkd.eye(2), bkd,
)

elbo_full = make_single_problem_elbo(
    var_dist_full, log_lik_2d, prior_full, base_samples, weights, bkd,
)

fitter_full = VariationalFitter(bkd, optimizer=opt)
result_full = fitter_full.fit(elbo_full)

print(f"Full-covariance Gaussian: {elbo_full.nvars()} parameters")
print(f"Negative ELBO: {result_full.neg_elbo():.4f}")
Full-covariance Gaussian: 5 parameters
Negative ELBO: 2.6823

Figure 1 compares the two Gaussian families on the 2D composite beam posterior.

Figure 1: Diagonal vs. full-covariance Gaussian VI on the 2D composite beam posterior. Left: the diagonal (mean-field) Gaussian has axis-aligned ellipses — it cannot represent the posterior’s tilt. Right: the full-covariance Gaussian tilts its ellipses to match the posterior correlation, producing a substantially better fit and a lower negative ELBO.

Family 3: Low-Rank Covariance

The full Cholesky parameterization stores \(d(d+1)/2\) entries. For large \(d\), this becomes expensive. The low-rank parameterization approximates the covariance as:

\[ \Sigma = D^2 + V V^\top \]

where \(D = \mathrm{diag}(e^{\ell_1}, \ldots, e^{\ell_d})\) is a diagonal matrix and \(V\) is a \(d \times r\) factor with \(r \ll d\). This captures the \(r\) most important correlation directions using only \(d + dr\) parameters instead of \(d(d+1)/2\).

In PyApprox, ConditionalLowRankCholGaussian implements this parameterization. For the 2D problem (\(d = 2\), \(r = 2\)), low-rank and full-covariance have the same capacity; the benefit appears for \(d \gg 2\).

from pyapprox.probability.conditional.multivariate_gaussian import (
    ConditionalLowRankCholGaussian,
)


def build_low_rank_gaussian(bkd, d, rank, nvars_in=1, degree=0):
    """Build a low-rank covariance Gaussian."""
    mean_func = _make_expansion_nd(bkd, nvars_in, degree, nqoi=d)
    log_diag_func = _make_expansion_nd(bkd, nvars_in, degree, nqoi=d)
    # Set lower bounds on log_diag to prevent numerical singularity
    for hp in log_diag_func.hyp_list().hyperparameters():
        bounds = hp.get_bounds()
        lower = bkd.full((bounds.shape[0],), -6.0)
        new_bounds = bkd.stack([lower, bounds[:, 1]], axis=1)
        hp.set_bounds(new_bounds)
    factor_func = None
    if rank > 0:
        factor_func = _make_expansion_nd(
            bkd, nvars_in, degree, nqoi=d * rank,
        )
    return ConditionalLowRankCholGaussian(
        mean_func, log_diag_func, factor_func, rank, bkd,
    )

var_dist_lr = build_low_rank_gaussian(bkd, d=2, rank=2)

elbo_lr = make_single_problem_elbo(
    var_dist_lr, log_lik_2d, prior_full, base_samples, weights, bkd,
)
fitter_lr = VariationalFitter(bkd, optimizer=opt)
result_lr = fitter_lr.fit(elbo_lr)

print(f"Low-rank Gaussian (rank=2): {elbo_lr.nvars()} parameters")
print(f"Negative ELBO: {result_lr.neg_elbo():.4f}")
Low-rank Gaussian (rank=2): 8 parameters
Negative ELBO: 2.6823
TipWhen to use low-rank

The low-rank parameterization is most useful when \(d\) is large (tens to hundreds) and the posterior correlation structure is approximately low-dimensional — for example, when observations constrain a few linear combinations of the parameters much more tightly than others. For \(d = 2\) or \(d = 3\), the full Cholesky parameterization is usually preferable.

Family 4: Beta for Bounded Parameters

Not all parameters are unbounded. For a probability \(p \in (0, 1)\), a Gaussian variational distribution is a poor choice because it places mass outside the valid range. A Beta variational family naturally respects the bounds.

To illustrate, consider the classic Beta-Bernoulli problem: we observe coin flips and want to infer the probability of heads \(p\). The prior is \(p \sim \text{Beta}(2, 2)\) (mildly favoring fairness), and we observe \([1, 1, 0, 1, 0]\) (3 heads, 2 tails). The exact conjugate posterior is \(\text{Beta}(5, 4)\).

from pyapprox.probability.conditional.beta import ConditionalBeta
from pyapprox.probability.univariate.beta import BetaMarginal
from pyapprox.inverse.conjugate.beta import BetaConjugatePosterior

# Exact conjugate posterior
prior_alpha, prior_beta = 2.0, 2.0
obs_list = [1.0, 1.0, 0.0, 1.0, 0.0]

conjugate = BetaConjugatePosterior(prior_alpha, prior_beta, bkd)
conjugate.compute(bkd.asarray([obs_list]))
exact_posterior_alpha = prior_alpha + sum(obs_list)
exact_posterior_beta = prior_beta + len(obs_list) - sum(obs_list)

print(f"Exact posterior: Beta({exact_posterior_alpha}, {exact_posterior_beta})")
print(f"Exact posterior mean: {exact_posterior_alpha / (exact_posterior_alpha + exact_posterior_beta):.4f}")
Exact posterior: Beta(5.0, 4.0)
Exact posterior mean: 0.5556
# VI with Beta variational family
log_alpha_func = _make_degree0_expansion(bkd, math.log(prior_alpha))
log_beta_func = _make_degree0_expansion(bkd, math.log(prior_beta))
cond_beta = ConditionalBeta(log_alpha_func, log_beta_func, bkd)

prior_marginal = BetaMarginal(prior_alpha, prior_beta, bkd)
obs = bkd.asarray([obs_list])

def bernoulli_log_lik(z):
    """Bernoulli log-likelihood: sum_i obs_i*log(p) + (1-obs_i)*log(1-p)."""
    p = bkd.clip(z, 1e-8, 1.0 - 1e-8)
    n_success = bkd.sum(obs)
    n_fail = obs.shape[1] - n_success
    return n_success * bkd.log(p) + n_fail * bkd.log(1.0 - p)

np.random.seed(42)
base_samples_beta = bkd.asarray(np.random.uniform(0.01, 0.99, (1, 500)))
weights_beta = bkd.full((1, 500), 1.0 / 500)

elbo_beta = make_single_problem_elbo(
    cond_beta, bernoulli_log_lik, prior_marginal,
    base_samples_beta, weights_beta, bkd,
)

fitter_beta = VariationalFitter(bkd)
result_beta = fitter_beta.fit(elbo_beta)

# Extract recovered parameters
dummy_x = bkd.zeros((cond_beta.nvars(), 1))
rec_alpha = math.exp(float(cond_beta._log_alpha_func(dummy_x)[0, 0]))
rec_beta = math.exp(float(cond_beta._log_beta_func(dummy_x)[0, 0]))
rec_mean = rec_alpha / (rec_alpha + rec_beta)

print(f"VI posterior: Beta({rec_alpha:.2f}, {rec_beta:.2f})")
print(f"VI posterior mean: {rec_mean:.4f}")
VI posterior: Beta(4.86, 3.88)
VI posterior mean: 0.5564

Figure 2 shows the VI Beta approximation compared to the exact conjugate posterior.

Figure 2: Beta VI on the coin-flip problem. The VI approximation (blue dashed) closely matches the exact conjugate posterior (orange). The Beta variational family naturally respects the parameter bounds \(p \in (0, 1)\).

Choosing a Family: Summary

The following guide summarizes when to use each family:

Family Parameters Captures Best for
Diagonal Gaussian \(2d\) Marginals only Independent or weakly correlated parameters
Full-covariance Gaussian \(d + d(d{+}1)/2\) Full correlation Moderate \(d\) with strong correlations
Low-rank Gaussian \(d + dr\) Top-\(r\) correlations Large \(d\) with low-rank correlation structure
Beta \(2\) per dim Bounded, possibly skewed Probability parameters or bounded quantities

When in doubt, start with the diagonal Gaussian for a quick baseline, then upgrade to full-covariance if the negative ELBO improves substantially.

ImportantValidating the approximation

A lower negative ELBO means a better approximation (closer to the true posterior). Use the ELBO to compare families: if switching from diagonal to full-covariance reduces the negative ELBO significantly, the correlation structure matters and the richer family is worthwhile.

Key Takeaways

  • The diagonal (mean-field) Gaussian is fast but ignores correlations; use it as a baseline or when parameters are approximately independent
  • The full-covariance Gaussian via Cholesky parameterization captures all pairwise correlations; the Cholesky factor ensures positive-definiteness by construction
  • The low-rank covariance parameterization (\(\Sigma = D^2 + VV^\top\)) scales to high dimensions by capturing only the top-\(r\) correlation directions
  • A Beta variational family naturally handles bounded parameters and can recover conjugate posteriors for Bernoulli likelihoods
  • Compare families using the negative ELBO: a significant decrease indicates the richer family captures important posterior structure

Exercises

  1. On the 2D composite beam, compare the negative ELBO across all three Gaussian families (diagonal, full-covariance, low-rank with \(r = 1\)). How much does full-covariance improve over diagonal? How does low-rank with \(r = 1\) compare?

  2. Change the noise standard deviation in the 2D beam to \(\sigma = 2.0\) (5x larger). Does the posterior become less correlated? Does the gap between diagonal and full-covariance ELBO shrink?

  3. For the Beta-Bernoulli problem, change the observations to \([1, 1, 1, 1, 1, 1, 1, 1, 0, 0]\) (8 heads, 2 tails). Verify that VI still recovers the correct conjugate posterior \(\text{Beta}(10, 4)\).

  4. (Challenge) Extend the composite beam to 3 uncertain moduli by adding a third layer. Compare the number of parameters and optimization time for diagonal vs. full-covariance vs. low-rank (\(r = 1\)) Gaussians. At what dimension does low-rank become advantageous?

Next Steps

Continue with: