Variational Inference
Notations
- Observed data: \(\mathcal{D} =\{(x_i,y_i)\vert i=0,1,\ldots d-1\}\)
- Model parameters: \(\mathbf{c}=\{c_0,c_1,\ldots,c_{n-1}\}\)
- Likelihood & prior densities: \(p(\mathcal{D}\vert\boldsymbol{c})\) & \(p(\boldsymbol{c})\)
- Posterior & approximate posterior densities: \(p(\boldsymbol{c}\vert\mathcal{D})\) & \(q_{\boldsymbol{\theta}}(\boldsymbol{c})\)
In a Bayesian framework, the posterior distribution for the parameters \(\boldsymbol{c}\) of a model \(M\) can be estimated as $$ p(\boldsymbol{c}\vert\mathcal{D}) \propto p(\mathcal{D}\vert\boldsymbol{c}) p(\boldsymbol{c}) $$ There are no closed-form solutions \(p(\boldsymbol{c}\vert\mathcal{D})\) for generic models. Sampling the posterior distribution \(p(\boldsymbol{c}\vert\mathcal{D})\) is typically hampered by high-dimensionality of \(\boldsymbol{c}\) and the high computational cost to estimate the likelihood \(p(\mathcal{D}\vert\boldsymbol{c})\) for a large number of model evaluations for example in Markov chain Monte Carlo schemes.
To work around this challenge, instead of sampling or evaluating \(p(\boldsymbol{c}\vert\mathcal{D})\), we will approximate it with a simpler distribution \(q_{\boldsymbol{\theta}}(\boldsymbol{c})\): $$ p(\boldsymbol{c}\vert\mathcal{D}) \approx q_{\boldsymbol{\theta}}(\boldsymbol{c}) $$ where \(\boldsymbol{\theta}\) are variational parameters to be optimized by minimizing the distance between the "true" and approximate distributions. The algorithm works as follows:
- Choose a tractable family of distributions \(\mathcal{Q} = \{q_\boldsymbol{\theta} : \boldsymbol{\theta} \in \Theta\}\)
- Use KL divergence as the measure of closeness, i.e.
$$ \text{KL}(q_{\boldsymbol{\theta}}(\boldsymbol{c}) \parallel p(\boldsymbol{c}\vert\mathcal{D})) = \int_{\boldsymbol{c}} q_{\boldsymbol{\theta}}(\boldsymbol{c}) \log\frac{q_{\boldsymbol{\theta}}(\boldsymbol{c})}{p(\boldsymbol{c}\vert\mathcal{D}))}\,d\boldsymbol{c} $$ to find \(\boldsymbol{\theta}^* \in \Theta\) such that \(q_{\boldsymbol{\theta}^*}(\boldsymbol{c})\) is closest to \(p(\boldsymbol{c}\vert\mathcal{D})\) among all possible choices for \(\boldsymbol{\theta}\).
The optimization algorithm uses the following Evidence Lower BOund (ELBO) expression for the loss function
The current version of the library implements variational inference algorithms that assume Gaussian approximate posteriors, i.e. for \(\mathcal{Q}\), as well as a Gausian likelihood. For the prior distribution \(p(\boldsymbol{c})\), both uniform and Gaussian distributions are implemented.
For the case where both the approximate posterior and the prior are multivariate Gaussian distributions
the KL divergence can be computed analytically as
If the prior is uniform, the above expression reduces to
The first term on the right-hand side of the ELBO function
with samples \(\boldsymbol{c}_k\) drawn from \(\mathcal{N}(\mu_\boldsymbol{\theta},\Sigma_\boldsymbol{\theta})\). Using the reparametrization trick,
where \(\boldsymbol{\epsilon}\propto\mathcal{N}(\boldsymbol{0},\mathbb{I})\) and \(L\) is the lower-triangular matrix resulted from the Cholesky decomposition of \(\Sigma_\boldsymbol{\theta}\), i.e. \(\Sigma_\boldsymbol{\theta}=L L^T\). The expected log-likelihood expression above becomes
Gradients with respect to \(\boldsymbol{\theta}\) components \(\mu_{\boldsymbol{\theta}}\) and \(\Sigma_\boldsymbol{\theta}\), are estimated through automatic differentiation tools if the model that depends of parameters \(\boldsymbol{c}\) is differentiable.