Variational Inference

Notations

  • Observed data: \(\mathcal{D} =\{(x_i,y_i)\vert i=0,1,\ldots d-1\}\)
  • Model parameters: \(\mathbf{c}=\{c_0,c_1,\ldots,c_{n-1}\}\)
  • Likelihood & prior densities: \(p(\mathcal{D}\vert\boldsymbol{c})\) & \(p(\boldsymbol{c})\)
  • Posterior & approximate posterior densities: \(p(\boldsymbol{c}\vert\mathcal{D})\) & \(q_{\boldsymbol{\theta}}(\boldsymbol{c})\)

In a Bayesian framework, the posterior distribution for the parameters \(\boldsymbol{c}\) of a model \(M\) can be estimated as $$ p(\boldsymbol{c}\vert\mathcal{D}) \propto p(\mathcal{D}\vert\boldsymbol{c}) p(\boldsymbol{c}) $$ There are no closed-form solutions \(p(\boldsymbol{c}\vert\mathcal{D})\) for generic models. Sampling the posterior distribution \(p(\boldsymbol{c}\vert\mathcal{D})\) is typically hampered by high-dimensionality of \(\boldsymbol{c}\) and the high computational cost to estimate the likelihood \(p(\mathcal{D}\vert\boldsymbol{c})\) for a large number of model evaluations for example in Markov chain Monte Carlo schemes.

To work around this challenge, instead of sampling or evaluating \(p(\boldsymbol{c}\vert\mathcal{D})\), we will approximate it with a simpler distribution \(q_{\boldsymbol{\theta}}(\boldsymbol{c})\): $$ p(\boldsymbol{c}\vert\mathcal{D}) \approx q_{\boldsymbol{\theta}}(\boldsymbol{c}) $$ where \(\boldsymbol{\theta}\) are variational parameters to be optimized by minimizing the distance between the "true" and approximate distributions. The algorithm works as follows:

  • Choose a tractable family of distributions \(\mathcal{Q} = \{q_\boldsymbol{\theta} : \boldsymbol{\theta} \in \Theta\}\)
  • Use KL divergence as the measure of closeness, i.e.

$$ \text{KL}(q_{\boldsymbol{\theta}}(\boldsymbol{c}) \parallel p(\boldsymbol{c}\vert\mathcal{D})) = \int_{\boldsymbol{c}} q_{\boldsymbol{\theta}}(\boldsymbol{c}) \log\frac{q_{\boldsymbol{\theta}}(\boldsymbol{c})}{p(\boldsymbol{c}\vert\mathcal{D}))}\,d\boldsymbol{c} $$ to find \(\boldsymbol{\theta}^* \in \Theta\) such that \(q_{\boldsymbol{\theta}^*}(\boldsymbol{c})\) is closest to \(p(\boldsymbol{c}\vert\mathcal{D})\) among all possible choices for \(\boldsymbol{\theta}\).

The optimization algorithm uses the following Evidence Lower BOund (ELBO) expression for the loss function

\[ \mathcal{L}(\boldsymbol{\theta}) = \mathbb{E}_{q}[\log p(\mathcal{D}\vert\boldsymbol{c})] - \text{KL}(q_{\boldsymbol{\theta}}(\boldsymbol{c})\parallel p(\boldsymbol{c})) + \text{const.} \]

The current version of the library implements variational inference algorithms that assume Gaussian approximate posteriors, i.e. for \(\mathcal{Q}\), as well as a Gausian likelihood. For the prior distribution \(p(\boldsymbol{c})\), both uniform and Gaussian distributions are implemented.

For the case where both the approximate posterior and the prior are multivariate Gaussian distributions

\[ q_{\boldsymbol{\theta}}(\boldsymbol{c})=\mathcal{N}(\boldsymbol{c} \mid \boldsymbol{\mu}_\boldsymbol{\theta},\boldsymbol{\Sigma}_{\boldsymbol{\theta}}),\,\, p(\boldsymbol{c}) = \mathcal{N}(\boldsymbol{c} \mid \boldsymbol{\mu}_{p}, \boldsymbol{\Sigma}_{p}), \]

the KL divergence can be computed analytically as

\[ \text{KL}(q_{\boldsymbol{\theta}}(\boldsymbol{c}) \parallel p(\boldsymbol{c})) = \frac{1}{2}\left[\text{tr}(\boldsymbol{\Sigma}_p^{-1}\boldsymbol{\Sigma}_{\boldsymbol{\theta}}) + (\boldsymbol{\mu}_p - \boldsymbol{\mu}_{\boldsymbol{\theta}})^\top\boldsymbol{\Sigma}_p^{-1}(\boldsymbol{\mu}_p - \boldsymbol{\mu}_{\boldsymbol{\theta}}) - n + \log\frac{|\boldsymbol{\Sigma}_p|}{|\boldsymbol{\Sigma}_{\boldsymbol{\theta}}|}\right] \]

If the prior is uniform, the above expression reduces to

\[ \text{KL}(q_{\boldsymbol{\theta}}(\boldsymbol{c}) \parallel p(\boldsymbol{c}))=- \frac{1}{2}\log|\boldsymbol{\Sigma}_{\boldsymbol{\theta}}|+\text{const.} \]

The first term on the right-hand side of the ELBO function

\[ \mathbb{E}_{q}[\log p(\mathcal{D}\vert\boldsymbol{c})]\approx \frac{1}{N_s}\sum_{k=1}^{N_s}\log p(\mathcal{D}\vert\boldsymbol{c}_k) \]

with samples \(\boldsymbol{c}_k\) drawn from \(\mathcal{N}(\mu_\boldsymbol{\theta},\Sigma_\boldsymbol{\theta})\). Using the reparametrization trick,

\[ \boldsymbol{c} = \mu_\boldsymbol{\theta} + L\boldsymbol{\epsilon} \]

where \(\boldsymbol{\epsilon}\propto\mathcal{N}(\boldsymbol{0},\mathbb{I})\) and \(L\) is the lower-triangular matrix resulted from the Cholesky decomposition of \(\Sigma_\boldsymbol{\theta}\), i.e. \(\Sigma_\boldsymbol{\theta}=L L^T\). The expected log-likelihood expression above becomes

\[ \mathbb{E}_{q}[\log p(\mathcal{D}\vert\boldsymbol{c})]=\mathbb{E}_{\boldsymbol{\epsilon}}[\log p(\mathcal{D}\vert\mu_\boldsymbol{\theta} + L\boldsymbol{\epsilon})]\approx \frac{1}{N_s}\sum_{k=1}^{N_s}\log p(\mathcal{D}\vert\mu_\boldsymbol{\theta} + L\boldsymbol{\epsilon}_k). \]

Gradients with respect to \(\boldsymbol{\theta}\) components \(\mu_{\boldsymbol{\theta}}\) and \(\Sigma_\boldsymbol{\theta}\), are estimated through automatic differentiation tools if the model that depends of parameters \(\boldsymbol{c}\) is differentiable.