Gaussian Networks

This tutorial describes how to perform efficient inference on a network of Gaussian-linear models using Bayesian networks, often referred to as Gaussian networks. Bayesian networks can be constructed to provide a compact representation of joint distribution, which can be used to marginalize out variables and condition on observational data without the need to construct a full representation of the joint density over all variables.

Directed Acyclic Graphs

A Bayesian network (BN) structure is a directed acyclic graphs (DAG) whose nodes represent random variables and whose edges represent probabilistic relationships between them. The graph can be represented by a tuple of vertices (or nodes) and edges G=(V,E). A node θjV is a parent of a random variable θiV if there is a directed edge [θjθi]E. The set of parents of random variable θi is denoted by θpa(i).

Lets import some necessary modules and then construct a DAG consisting of 3 groups of variables.

import numpy as np
from scipy import stats
import pyapprox as pya
from pyapprox.gaussian_network import *
import copy
from pyapprox.configure_plots import *
np.random.seed(1)

nnodes=3
fig,ax=plt.subplots(1,1,figsize=(8,3))
_=plot_hierarchical_network(nnodes,ax)
fig.tight_layout()
plot bayesian networks

For this network we have pa(θ1)=,pa(θ2)={θ1},pa(θ3)={θ2}.

Conditional probability distributions

Bayesian networks use onditional probability distributions (CPDs) to encode the relationships between variables of the graph. Let A,B,C denote three random variables. A is conditionally independent of B given C in distribution P, denoted by ABC, if

P(A=a,B=bC=c)=P(A=aC=c)P(B=bC=c).

The above graph encodes that θ1θ3θ2. CPDs, such as this, can be used to form a compact representation of the joint density between all variables in the graph. Specifically, the joint distribution of the variables can be expressed as a product of the set of conditional probability distributions

P(θ1,θM)=Mi=1P(θiθpa(i)).

For our example we have

P(θ1,θ2,θ3)=P(θ1pa(θ1))P(θ2pa(θ2))P(θ3pa(θ3))=P(θ1)P(θ2θ1)P(θ3θ2)

Any Bayesian networks can be made up of three types of structures. The hierarchical (or serial) structure we just plotted and the diverging and peer (or V-) structure.

For the peer network plotted using the code below we have pa(θ1)=,pa(θ2)=,pa(θ3)={θ1,θ2} and

P(θ1,θ2,θ3)=P(θ1pa(θ1))P(θ2pa(θ2))P(θ3pa(θ3))=P(θ1)P(θ2)P(θ3θ1,θ2)
nnodes=3
fig,ax=plt.subplots(1,1,figsize=(8,3))
_=plot_peer_network(nnodes,ax)
fig.tight_layout()
plot bayesian networks

For the diverging network plotted using the code below we have pa(θ1)={θ3},pa(θ2)={θ3},pa(θ3)= and

P(θ1,θ2,θ3)=P(θ1pa(θ1))P(θ2pa(θ2))P(θ3pa(θ3))=P(θ3)P(θ1θ3)P(θ2θ3)
nnodes=3
fig,ax=plt.subplots(1,1,figsize=(8,3))
_=plot_diverging_network(nnodes,ax)
fig.tight_layout()
plot bayesian networks

For Gaussian networks we assume that the parameters of each node θi=[θi,1,,θi,Pi]T are related by

θi=jpa(i)Aijθj+bi+vi,

where biRPi is a deterministic shift, vi is a Gaussian noise with mean zero and covariance ΣviRPi×Pi. Consequently, the CPDs take the form

P(θiθpa(i))N(jpa(i)Aijθj+bi,Σvi)

which means that θiN(μi,Σii) where

μi=bi+Aijμj,Σii=Σvi+AijΣjjATij

The Canonical Form

Unless a variable θi is a root node of a network, i.e. pa(θi)= the CPD is in general not Gaussian; for root nodes the CPD is just the density of θi. However we can represent the CPDs of no-root nodes and the Gaussian density of root nodes with a consistent reprsentation. Specifically we will use the canonical form of a set of variables X

ϕ(x;K,h,g)=exp(g+hTx12xTKx)

This canonical form can be used to represent each CPD in a graph such that

P(θ1,,θM)=Mi=1ϕi

For example, the hierarchical structure can be represented by three canonical factors ϕ1(θ1),ϕ2(θ1,θ2),ϕ3(θ2,θ3) such that

P(θ1,θ2,θ3)=ϕ1(θ1)ϕ2(θ1,θ2)ϕ3(θ2,θ3)

where we have dropped the dependence on K,h,g for convenience. These three factors respectively correspond to the three CPDs P(θ1),P(θ2θ1),P(θ3θ2)

To form the joint density (which we want to avoid in practice) and perform inference and marginalization of a Gaussian graph we must first understand the notion of scope and three basic operations on canonical factors: multiplication, marginalization and conditioning also knows as reduction.

Multiplying Canonical Forms

The scope of canonical forms

The scope of a canonical form ϕ(x) is the set of variables X and is denoted Scope[ϕ]. Consider the hierarchical structure, just discussed the scope of the 3 factors are

Scope[ϕ1]={θ1},Scope[ϕ2]={θ1,θ2},Scope[ϕ3]={θ2,θ3}.

The multiplication of canonical factors with the same scope is simple X is simple

ϕ1(X;K1,h1,g1)ϕ2(X;K2,h2,g2)=ϕ1(X;K1+K2,h1+h2,g1+g2).

To multiply two canonical forms with different scopes we must extend the scopes to match and then apply the previous formula. This can be done by adding zeros in K and h of the canonical form. For example consider the two canonical factors of the θ1,θ2 with P1=1,P2=1

ϕ2(θ1,θ2;[K2,11K2,12K2,21K2,22],[h2,1h2,2],g2),ϕ3(θ2,θ3;[K3,11K3,12K3,21K3,22],[h2,1h3,2],g3)

Extending the scope and multiplying proceeds as follows

ϕ2(θ1,θ2,θ3;[K2,11K2,120K2,21K2,220000],[h2,1h2,20],g2)ϕ3(θ1,θ2,θ3;[0000K3,11K3,120K3,21K3,22],[h2,1h3,2],g3)=ϕ2×3(θ1,θ2,θ3;[K2,11K2,120K2,21K2,22+K3,11K3,120K3,21K3,22],[h2,1h2,2+h3,1h3,2],g2+g3)

The canonical form of a Gaussian distribution

Now we understand how to multiply two factors of different scope we can now discuss how to convert CPDs into canonical factors. The canonical form of a normal distribution with mean m and covariance C has the parameters K=C1, h=Km and

g=12mTh12nlog(2π)+12log|K|.

The derivation of these expressions is simple and we leave to the reader. The derivation of the canonical form of a linear-Gaussian CPD is more involved however and we derive it here.

The canonical form of a CPD

First we assume that the joint density of two Gaussian variables θi,θj can be represented as the product of two canonical forms referred to as factors. Specifically let ϕi(θi,θj) be the factor representing the CPD P(θiθj), let ϕj be the canonical form of the Gaussian θj, and assume

P(θi,θj)=P(θiθj)P(θj)=ϕθiϕθj

Given the linear relationship of the CPD θi=Aijθj+vi the inverse of the covariance of the Gaussian joint density P(θi,θj) is

Ki=[ΣjjΣjjATijAijΣjjAijΣjjATij+Σvi]1=[Σ1jj+ATijΣ1viAijATijΣ1viΣ1viAijΣ1vi]

where the second equality is derived using the matrix inversion lemma. Because P(θi)phi_j` has Kj=Σ1jjRPj×Pj. Multiply he two canonical factors, making sure to account for the different scopes we have

[Σ1jj000][Ki11Ki12Ki21Ki22]=[Σ1jj+ATijΣ1viAijATijΣ1viΣ1viAijΣ1vi]

Equating terms in the above equation yields

(1)Ki11=ATijΣ1viAij,Ki12=KTi21=ATijΣ1vi,Ki22=Σ1vi.

A similar procedure can be used to find h=[(ATijΣ1vib)T,(Σ1vib)T]T and g.

Computing the joint density with the canonical forms

We are now in a position to be able to compute the joint density of all the variables in a Gaussian Network. Note that we would never want to do this in practice because it negates the benefit of having the compact representation provided by the Gaussian network.

The following builds a hierarchical network with three ndoes. Each node has Pi=2 variables. We set the matrices Aij=aijI to be a diagonal matrix with the same entries aij along the diagonal. This means that we are saying that only the k-th variable k=1,,Pi of the i-th node is related to the k-th variable of the j-th node.

nnodes = 3
graph = nx.DiGraph()
prior_covs = [1,2,3]
prior_means = [-1,-2,-3]
cpd_scales  =[0.5,0.4]
node_labels = [f'Node_{ii}' for ii in range(nnodes)]
nparams = np.array([2]*3)
cpd_mats  = [None,cpd_scales[0]*np.eye(nparams[1],nparams[0]),
             cpd_scales[1]*np.eye(nparams[2],nparams[1])]

Now we set up the directed acyclic graph providing the information to construct the CPDs. Specifically we specify the Gaussian distributions of all the root nodes. in this example there is just one root node θi. We then specify the parameters of the CPDs for the remaining two nodes. Here we specify the CPD covariance Σvi and shift bi such that the mean and variance of the paramters matches those specified above. To do this we note that

E[θ2]=E[A21θ1+b2]=A21E[θ1]+b2E[θ3]=A32E[θ2]+b3

and so set

b2=E[θ2]A21E[θ1]b3=E[θ3]A32E[θ2].

Similarly we define the CPD covariance so that the diagonal of the prior covariance matches the values specified

V[θ2]=Σv2+A21V[θ1]AT21,V[θ3]=Σ32+A32V[θ2]AT32so
Σ21=V[θ2]A21V[θ1]AT21,Σ32=V[θ3]A32V[θ2]AT32
ii=0
graph.add_node(
    ii,label=node_labels[ii],cpd_cov=prior_covs[ii]*np.eye(nparams[ii]),
    nparams=nparams[ii],cpd_mat=cpd_mats[ii],
    cpd_mean=prior_means[ii]*np.ones((nparams[ii],1)))
for ii in range(1,nnodes):
    cpd_mean = np.ones((nparams[ii],1))*(
        prior_means[ii]-cpd_scales[ii-1]*prior_means[ii-1])
    cpd_cov = np.eye(nparams[ii])*max(
        1e-8,prior_covs[ii]-cpd_scales[ii-1]**2*prior_covs[ii-1])
    graph.add_node(ii,label=node_labels[ii],cpd_cov=cpd_cov,
                   nparams=nparams[ii],cpd_mat=cpd_mats[ii],
                   cpd_mean=cpd_mean)

graph.add_edges_from([(ii,ii+1) for ii in range(nnodes-1)])


network = GaussianNetwork(graph)
network.convert_to_compact_factors()
labels = [l[1] for l in network.graph.nodes.data('label')]
factor_prior = network.factors[0]
for ii in range(1,len(network.factors)):
    factor_prior *= network.factors[ii]
prior_mean,prior_cov = convert_gaussian_from_canonical_form(
    factor_prior.precision_matrix,factor_prior.shift)
print('Prior Mean\n',prior_mean)
print('Prior Covariance\n',prior_cov)

Out:

Prior Mean
 [-1. -1. -2. -2. -3. -3.]
Prior Covariance
 [[1.  0.  0.5 0.  0.2 0. ]
 [0.  1.  0.  0.5 0.  0.2]
 [0.5 0.  2.  0.  0.8 0. ]
 [0.  0.5 0.  2.  0.  0.8]
 [0.2 0.  0.8 0.  3.  0. ]
 [0.  0.2 0.  0.8 0.  3. ]]

We can check the mean and covariance diagonal of the prior match the values we specified

true_prior_mean = np.hstack(
    [[prior_means[ii]]*nparams[ii] for ii in range(nnodes)])
assert np.allclose(true_prior_mean,prior_mean)
true_prior_var = np.hstack(
    [[prior_covs[ii]]*nparams[ii] for ii in range(nnodes)])
assert np.allclose(true_prior_var,np.diag(prior_cov))

If the reader is interested they can also compare the entire prior covariance with

[Σ11Σ11AT12Σ11AT12AT32A21Σ11Σ22Σ22AT32A32A21Σ11A32Σ22Σ33]

Conditioning The Canonical Form

Using the canonical form of these factors we can easily condition them on available data. Given a canonical form over two variables X,Y with

(2)K=[KXXKXYKYXKYY],h=[hXhY]

and given data y (also called evidence in the literature) the paramterization of the canonical form of the factor conditioned on the data is simply

(3)K=KXX,h=hXKXYy,g=g+hTYy12yTKyyy

Classical inference for linear Gaussian models as a two node network

Gaussian networks enable efficient inference of the unknown variables using observational data. Classical inference for linear-Gaussian models can be represented with a graph consiting of two nodes. First setup a dataless graph of one node

nnodes=1
graph = nx.DiGraph()
ii=0
graph.add_node(
    ii,label=node_labels[ii],cpd_cov=prior_covs[ii]*np.eye(nparams[ii]),
    nparams=nparams[ii],cpd_mat=cpd_mats[ii],
    cpd_mean=prior_means[ii]*np.ones((nparams[ii],1)))

nsamples = [10]
noise_std=[0.01]*nnodes
data_cpd_mats=[np.random.normal(0,1,(nsamples[ii],nparams[ii]))
               for ii in range(nnodes)]
data_cpd_vecs=[np.ones((nsamples[ii],1)) for ii in range(nnodes)]
true_coefs=[np.random.normal(0,np.sqrt(prior_covs[ii]),(nparams[ii],1))
            for ii in range(nnodes)]
noise_covs=[np.eye(nsamples[ii])*noise_std[ii]**2
            for ii in range(nnodes)]

network = GaussianNetwork(graph)

Now lets add data and plot the new graph. Specifically we will add data

Y=Φθ1+b+ϵ

Which has the form of a CPD. Here epsilon is mean zero Gaussian noise with covariance Σϵ=σ2I.

network.add_data_to_network(data_cpd_mats,data_cpd_vecs,noise_covs)

fig,ax=plt.subplots(1,1,figsize=(8,6))
plot_hierarchical_network_network_with_data(network.graph,ax)
plot bayesian networks

As you can see we have added a new CPD P(Y1θ1) reprented by the canonical form ϕ2(Y1,θ1,K,h,g) which by using (1) and (3) we can see has

K=ΦΣ1ϵΦT,h=ΦTΣ1ϵb,

We then combine this conditioned CPD factor with its parent factor (associated with the prior distribution of the parameters θi by multiplying these two factors together after eliminating the data variables from the scope of the CPD. The combined factor has parameters

K=ΦiΣ1ϵiΦTi+Σ1ii,h=Σ1iiμi+ΦTiΣ1ϵiy

which represents a Gaussian with mean and covariance given by

Σpost=(ΦiΣ1ϵiΦTi+Σ1ii)1,μpost=Σpost(Σ1iiμi+ΦTiΣ1ϵiy)

which is just the usual expression for the posterior of a gaussian linear model using only the linear model, noise and prior associated with a single node. Here we used the relationship between the canonical factors and the covariance C and mean m of the equivalent Gaussian distribution

C=K1,m=Ch

Let’s compute the posterior with this procedure

network.convert_to_compact_factors()

noise = [noise_std[ii]*np.random.normal(
    0,noise_std[ii],(nsamples[ii],1))]
values_train=[b.dot(c)+s+n for b,c,s,n in zip(
    data_cpd_mats,true_coefs,data_cpd_vecs,noise)]

evidence, evidence_ids = network.assemble_evidence(values_train)


for factor in network.factors:
    factor.condition(evidence_ids,evidence)
factor_post = network.factors[0]
for jj in range(1, len(network.factors)):
    factor_post *= network.factors[jj]
gauss_post = convert_gaussian_from_canonical_form(
    factor_post.precision_matrix,factor_post.shift)

We can check this matches the posterior returned by the classical formulas

from pyapprox.bayesian_inference.laplace import \
    laplace_posterior_approximation_for_linear_models
true_post = laplace_posterior_approximation_for_linear_models(
    data_cpd_mats[0],prior_means[ii]*np.ones((nparams[0],1)),
    np.linalg.inv(prior_covs[ii]*np.eye(nparams[0])),
    np.linalg.inv(noise_covs[0]),values_train[0],data_cpd_vecs[0])

assert np.allclose(gauss_post[1],true_post[1])
assert np.allclose(gauss_post[0],true_post[0].squeeze())

Marginalizing Canonical Forms

Gaussian networks are best used when one wants to comute a marginal of the joint density of parameters, possibly conditioned on data. The following describes the process of marginalization and conditioning often referred to as the sum-product eliminate variable algorithm.

First lets discuss how to marginalize a canoncial form, e.g. compute

ϕ(X,Y,K,h,g)dY

which marginalizes out the variable Y from a canonical form also involving the variable X. Provided KYY in (2) is positive definite the marginalized canonical form has parameters

K=KXXKXYK1YYKYX,h=hXKXYK1YYhY,g=g+hTYy12yTKYYy

Computing the marginal density of a network conditioned on data

Again consider a three model recursive network

nnodes = 3
graph = nx.DiGraph()
prior_covs = [1,2,3]
prior_means = [-1,-2,-3]
cpd_scales  =[0.5,0.4]
node_labels = [f'Node_{ii}' for ii in range(nnodes)]
nparams = np.array([2]*3)
cpd_mats  = [None,cpd_scales[0]*np.eye(nparams[1],nparams[0]),
             cpd_scales[1]*np.eye(nparams[2],nparams[1])]

ii=0
graph.add_node(
    ii,label=node_labels[ii],cpd_cov=prior_covs[ii]*np.eye(nparams[ii]),
    nparams=nparams[ii],cpd_mat=cpd_mats[ii],
    cpd_mean=prior_means[ii]*np.ones((nparams[ii],1)))
for ii in range(1,nnodes):
    cpd_mean = np.ones((nparams[ii],1))*(
        prior_means[ii]-cpd_scales[ii-1]*prior_means[ii-1])
    cpd_cov = np.eye(nparams[ii])*max(
        1e-8,prior_covs[ii]-cpd_scales[ii-1]**2*prior_covs[ii-1])
    graph.add_node(ii,label=node_labels[ii],cpd_cov=cpd_cov,
                   nparams=nparams[ii],cpd_mat=cpd_mats[ii],
                   cpd_mean=cpd_mean)

graph.add_edges_from([(ii,ii+1) for ii in range(nnodes-1)])

nsamples = [3]*nnodes
noise_std=[0.01]*nnodes
data_cpd_mats=[np.random.normal(0,1,(nsamples[ii],nparams[ii]))
               for ii in range(nnodes)]
data_cpd_vecs=[np.ones((nsamples[ii],1)) for ii in range(nnodes)]
true_coefs=[np.random.normal(0,np.sqrt(prior_covs[ii]),(nparams[ii],1))
            for ii in range(nnodes)]
noise_covs=[np.eye(nsamples[ii])*noise_std[ii]**2
            for ii in range(nnodes)]

network = GaussianNetwork(graph)
network.add_data_to_network(data_cpd_mats,data_cpd_vecs,noise_covs)

noise = [noise_std[ii]*np.random.normal(
    0,noise_std[ii],(nsamples[ii],1)) for ii in range(nnodes)]
values_train=[b.dot(c)+s+n for b,c,s,n in zip(
    data_cpd_mats,true_coefs,data_cpd_vecs,noise)]

evidence, evidence_ids = network.assemble_evidence(values_train)

fig,ax=plt.subplots(1,1,figsize=(8,6))
plot_hierarchical_network_network_with_data(network.graph,ax)
#plt.show()
plot bayesian networks

The sum-product eliminate variable algorithm begins by first conditioning all the network factors on the available data and then finding the ids of the variables o eliminate from the factors. Let the scope of the entire network be Ψ, e.g. for this example Φ={θ1,θ2,θ3,Y1,Y2,Y3} if we wish to compute the θ3 marginal, i.e. marginalize out all other variables, the variables that need to be eliminated will be associated with the nodes Φ({θ3}{Y1,Y2,Y3})={θ1,θ2}. Variables associated with evidence (data) should not be identified for elimination (marginalization).

network.convert_to_compact_factors()
for factor in network.factors:
     factor.condition(evidence_ids,evidence)

query_labels = [node_labels[2]]
eliminate_ids = get_var_ids_to_eliminate_from_node_query(
    network.node_var_ids,network.node_labels,query_labels,evidence_ids)

Once the variables to eliminate have been identified, they are marginalized out of any factor in which they are present; other factors are left untouched. The marginalized factors are then multiplied with the remaining factors to compute the desired marginal density using the sum product variable elimination algorithm.

factor_post = sum_product_variable_elimination(network.factors,eliminate_ids)

gauss_post = convert_gaussian_from_canonical_form(
    factor_post.precision_matrix,factor_post.shift)

print('Posterior Mean\n',gauss_post[0])
print('Posterior Covariance\n',gauss_post[1])

Out:

Posterior Mean
 [-0.36647195  1.0164208 ]
Posterior Covariance
 [[ 6.33358021e-03 -5.39518433e-04]
 [-5.39518433e-04  6.51970946e-05]]