Deep Gaussian Processes

PyApprox Tutorial Library

Composing GPs into hierarchical models that learn input-dependent length scales, capture sharp features, and produce non-Gaussian predictive distributions while preserving principled uncertainty.

Download Notebook

Download as Jupyter Notebook

Learning Objectives

After completing this tutorial, you will be able to:

Explain what a Deep Gaussian Process (DGP) is and how composing GP layers produces non-stationary behavior from stationary ingredients
Identify situations where a single-layer GP is fundamentally limited and a DGP is appropriate
Recognize that the DGP predictive distribution is generally not Gaussian, even with a Gaussian likelihood
Describe the role of inducing points at every layer in making DGP inference tractable
Anticipate the inference and identifiability challenges that motivate the doubly-stochastic variational framework treated in later tutorials

Prerequisites

Complete Sparse Variational Gaussian Processes before this tutorial. The DGP construction reuses the SVGP machinery — inducing points, the variational posterior $q(\mathbf{u})$, the ELBO — at every layer of the deep model.

Why Compose GPs?

The Gaussian process from earlier tutorials uses a stationary kernel: the covariance between two function values $f(\mathbf{x})$ and $f(\mathbf{x}')$ depends only on the displacement $\mathbf{x} - \mathbf{x}'$, not on the absolute locations. This gives a single GP its characteristic behavior: one length scale that applies everywhere, and uncertainty bands that close down near data and open up away from it.

Stationarity is reasonable for many engineering responses, but it fails on a surprising number of problems. Two diagnostic situations:

Discontinuities or sharp transitions. A stationary GP cannot resolve a step without ringing — the same length scale that needs to be short at the discontinuity makes the GP overfit smooth regions on either side.
Non-uniform smoothness. A response that is rough in one region and smooth in another (shock fronts in CFD, phase transitions, regime-switching dynamics) cannot be well-represented by any single length scale.

The figure below illustrates the first failure mode and contrasts it with what a two-layer DGP produces on the same data.

Figure 1: Single-layer SVGP (left) versus a two-layer DGP (right) fitted on the same 1D step-function data with the same kernel family and the same Adam → L-BFGS-B fitter — only the depth differs. The single GP smooths the transition with a single length scale and ends up overconfident: its uncertainty band hugs the smooth interpolant tightly, even though the truth (dashed) is a sharp step. The DGP makes a noticeably sharper transition and, importantly, *widens* its uncertainty band in the regions where it has less information — honest behavior the single GP cannot produce. The annotation boxes give test-set RMSE and mean log-likelihood on 200 held-out points; the DGP improves RMSE by 2× and improves log-likelihood by an order of magnitude, the latter dominated by the single GP’s mis-calibrated bands.

The fix is not to invent a more elaborate stationary kernel. It is to make the kernel itself input-dependent by composing GPs.

How Good Is the Approximation?

The figures so far make a visual case for the DGP, but visual impressions are easy to mislead. Two quantitative checks anchor what we see.

First, point-prediction accuracy: the small annotation boxes on each panel of the comparison figure above show test-set RMSE and mean log-likelihood on a held-out set of 200 inputs sampled from the same step function. The DGP’s RMSE is roughly half the single GP’s. The mean log-likelihood gap is much larger, and the difference is informative: log-likelihood rewards a model that gets both the mean and the uncertainty right. The single GP’s log-likelihood is strongly negative because its bands are too narrow at the discontinuity, so the truth falls outside the predictive distribution and contributes a large penalty. The DGP’s log-likelihood is positive because its wider, more honest bands assign reasonable density to the true values.

Second, calibration of predictive intervals: even a model with low RMSE can produce dishonest uncertainty bands. To check this we ask, at each nominal credible-interval level (say 90%), what fraction of held-out points actually fall inside the predicted band? A perfectly calibrated model gives empirical coverage equal to the nominal level. A model whose curve sits below the diagonal is overconfident — its bands are too narrow. A model above the diagonal is underconfident — bands too wide.

Figure 2: Reliability diagram for the single-layer SVGP and the two-layer DGP on the step-function held-out test set. The horizontal axis is the nominal credible-interval level; the vertical axis is the empirical fraction of test points falling inside the corresponding band. Points on the diagonal indicate perfect calibration. The single GP sits well below the diagonal across all levels: it is severely overconfident, its bands too narrow to cover the true step at the discontinuity. The DGP sits above the diagonal at low-to-mid nominal levels and approaches it at the highest level. The DGP is underconfident — its bands are wider than they strictly need to be — but never *over*confident, which is the more dangerous failure mode in scientific applications.

Neither model sits exactly on the diagonal, but their failure modes are different. The single GP is overconfident: it claims sharp uncertainty bands that don’t actually contain the truth at the discontinuity. The DGP is underconfident: it claims wider bands than it strictly needs, but the bands do contain the truth. The two have very different consequences for downstream use. An overconfident model gives you false certainty about wrong predictions — a quietly dangerous failure. An underconfident model gives you honest predictions with conservative uncertainty bounds — less efficient but safer.

This trade is a real one. The DGP’s wider bands are not a free lunch: in regions where the data is dense and the function is locally simple, the DGP could in principle produce sharper bands. The chained Adam → L-BFGS-B optimizer with $M=8$ inducing points per layer on $N=30$ training points has not pinned the variational posterior down tightly enough to be sharper. More inducing points, more iterations, or more training data would all narrow the gap. For a tutorial-scale fit on a difficult step-function problem, the DGP’s behavior is exactly what calibrated probabilistic regression should look like: don’t overstate certainty.

The Construction

A two-layer Deep Gaussian Process stacks two GPs:

\[ f_1 \sim \mathcal{GP}\!\bigl(0,\, k_1(\mathbf{x}, \mathbf{x}')\bigr), \qquad f_2 \sim \mathcal{GP}\!\bigl(0,\, k_2(\mathbf{h}, \mathbf{h}')\bigr). \]

The output of the first GP becomes the input of the second:

\[ y = f_2\!\bigl(f_1(\mathbf{x})\bigr) + \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0,\sigma^2). \]

Each kernel is stationary in its own input space, but the composition is not stationary in $\mathbf{x}$. The first layer learns a warping of the input; the second layer is a smooth function on the warped space. Together they produce input-dependent effective length scales without ever calling on a non-stationary kernel.

Figure 3 decomposes this composition for a concrete example with $f_1(x) = 1.6\tanh(2.5x)$ (a smooth sigmoid) and $f_2(h) = 0.8h + 0.4\sin(2h)$ (a smooth, mildly wavy near-linear function). A smooth $f_1$ (left) maps the input to a warped hidden space. A smooth $f_2$ (middle) is defined on that hidden space. The composition $f_2 \circ f_1$ (right) is sharp where $f_1$ is steep and gentle where $f_1$ is flat — even though both ingredients individually are smooth. This is the mechanism behind the DGP’s behavior in Figure 1: layer 0 finds a steep warping near the discontinuity, and the composition with the leaf produces the sharp transition without any non-stationary kernel.

Figure 3: How two stationary GPs compose into a non-stationary effect. Layer 1 (blue) is a smooth warping of the input. Layer 2 (orange) is a smooth function of the warped value. The composition (green) is steep where $f_1$ is steep, producing the sharp transition needed for step-like data without any non-stationary kernel.

The same construction extends to $L$ layers:

\[ f_L \circ f_{L-1} \circ \cdots \circ f_1, \qquad f_\ell \sim \mathcal{GP}\!\bigl(0,\, k_\ell\bigr) \text{ for } \ell = 1, \dots, L. \]

In practice $L=2$ or $L=3$ already captures most of the additional flexibility, and going deeper introduces inference difficulties that outweigh the gains for most surrogate-modeling applications.

Inference: Inducing Points at Every Layer

The single-layer SVGP from the previous tutorial introduced inducing points $\mathbf{Z}$ as a compressed summary of the GP at $M \ll N$ chosen locations, with a variational distribution $q(\mathbf{u}) = \mathcal{N}(\mathbf{m}, S)$ over the function values at those locations. The ELBO bounded the log marginal likelihood from below and was tractable to optimize.

A DGP with $L$ layers needs all of this, $L$ times over. Each layer $\ell$ has its own:

inducing locations $\mathbf{Z}_\ell$ in the input space appropriate for that layer,
variational distribution $q(\mathbf{u}_\ell) = \mathcal{N}(\mathbf{m}_\ell, S_\ell)$,
kernel hyperparameters $\boldsymbol\theta_\ell$.

The deep ELBO has the same shape as the SVGP ELBO: a data-fit term minus a sum of KL divergences, one per layer:

\[ \text{ELBO} \;=\; \sum_{i=1}^{N} \mathbb{E}_{q(f_L(\mathbf{x}_i))}\!\bigl[\log p(y_i \mid f_L(\mathbf{x}_i))\bigr] \;-\; \sum_{\ell=1}^{L} \mathrm{KL}\!\bigl[q(\mathbf{u}_\ell) \,\big\|\, p(\mathbf{u}_\ell)\bigr]. \]

The KL terms have closed-form Gaussian-Gaussian expressions. The data-fit term is the one that gets harder: $f_L(\mathbf{x}_i)$ is the leaf-layer value, but the leaf layer’s input is the previous layer’s output, which is itself a random function of $\mathbf{x}_i$. The expectation marginalizes over the joint posterior of all hidden layers, which is no longer Gaussian.

This is where the doubly-stochastic machinery comes in: one source of stochasticity is the variational distribution $q(\mathbf{u}_\ell)$ at each layer, the second is the propagation of samples between layers. Practical DGP fitting marginalizes $\mathbf{u}_\ell$ analytically inside each layer, then propagates Monte Carlo (or Gauss–Hermite, or quasi-Monte-Carlo) samples through the chain to estimate the data-fit term.

For now the practical takeaway is this: fitting a DGP is structurally like fitting $L$ SVGPs simultaneously, with samples from each layer feeding the next. The cost per gradient step is roughly $L$ times the cost of a single SVGP step. Convergence is harder than for a single SVGP because the joint loss is non-convex.

What the Predictive Distribution Looks Like

The single-layer SVGP gives a Gaussian predictive distribution at every test point. The DGP does not. Even with a Gaussian likelihood and Gaussian variational distributions at every layer, the predictive distribution at a new $\mathbf{x}^*$ is:

\[ q(f_L(\mathbf{x}^*)) \;=\; \int\! q\bigl(f_L \mid f_{L-1}\bigr)\, q\bigl(f_{L-1} \mid f_{L-2}\bigr) \,\cdots\, q\bigl(f_1 \mid \mathbf{x}^*\bigr) \, df_{L-1} \cdots df_1. \]

Each conditional $q(f_\ell \mid f_{\ell-1})$ is Gaussian, but the marginal of $f_L(\mathbf{x}^*)$ — averaging over all the intermediate randomness — is a mixture and need not be Gaussian. In regions where the layer-1 warping is steep, the marginal can be markedly skewed or heavy-tailed.

The figure below makes this concrete. The same fitted DGP is queried at three test locations chosen to land in qualitatively different regions: one in a smooth region away from the warping, one inside the transition region near $x = 0$, and one in extrapolation territory. At each location the histogram of predictive samples is overlaid with the Gaussian whose mean and standard deviation match the samples. In the smooth region and inside the transition the histograms are well-approximated by a Gaussian — a 2-layer DGP fitted to this data has not produced much non-Gaussianity at training inputs. In extrapolation, however, the histogram is strongly skewed: a tall sharp peak near $f(x) = 1$ with a long tail reaching down toward $-1$, completely unlike the matched Gaussian. The non-Gaussian regime is where the DGP’s predictive flexibility shows up most clearly.

Figure 4: Predictive distributions at three test points from the same fitted two-layer DGP. In the smooth region (left, blue) and inside the transition (center, orange) the predictive samples are well-approximated by a Gaussian (dashed black) — the moment-matched Gaussian and the histogram are in close agreement. In extrapolation (right, purple) the histogram is strongly skewed and bimodal-looking: a sharp peak near $+1$ with a long tail reaching down toward $-1$. The Gaussian with the same mean and standard deviation is visibly inadequate. This is the regime where the DGP’s predictive distribution genuinely needs the full sample histogram, not just its first two moments.

The extrapolation regime is where the DGP’s non-Gaussianity matters most in practice. When the model is asked to predict outside the training data, it is also asked to propagate its uncertainty about the warping forward through the chain. That propagation mixes a wide hidden-layer posterior through a non-trivial leaf, producing the kind of skewed predictive seen on the right. Collapsing this to mean and standard deviation throws away the structure that makes the prediction informative — you would conclude “the model is uncertain about everything between $-1$ and $+1$” when in fact the model has put most of its predictive mass on a specific narrow band near $+1$, with a long lower tail accounting for the chance that the warping had a different shape.

In code, this means you should not collapse a DGP’s predictive distribution to its mean and standard deviation when reporting uncertainty in extrapolation regions. Use the full predictive samples for downstream tasks like uncertainty propagation, calibration analysis, or risk-aware decision making.

Inspecting What the DGP Learned

A fitted DGP is partially interpretable. The first layer’s posterior mean $\mu_{f_1}(\mathbf{x})$ is itself a function of the original input — it is the warping that the data found useful. The leaf layer’s posterior mean $\mu_{f_2}(\mathbf{h})$ is a smooth function of the hidden value. Plotting them separately reveals the structure the DGP discovered.

Note that the left panel of Figure 5 does not show the same quantity as Figure 1. Figure 1’s vertical axis is $f(x)$, the model’s prediction of the observed $y$. The left panel below has $f_1(x)$ on its vertical axis — the hidden value the warping assigns to each input. The black dots are the same training inputs $x_i$ from Figure 1, but they are plotted at their fitted hidden values $h_i = \mu_{f_1}(x_i)$, not at their observed $y_i$.

A helpful way to read the right panel of the figure below: the deep model has effectively rewritten the training set. The original observations are pairs $(x_i, y_i)$; the leaf layer is performing GP regression not on those, but on the rewritten pairs $(h_i, y_i)$, where $h_i = \mu_{f_1}(x_i)$ is the hidden value the warping assigned to training input $x_i$. The black dots in the right panel show this rewritten dataset, and the green dots show the leaf’s per-point predictions $\hat y_i = \mu_{f_2}(x_i, h_i)$ at each training input. The thin grey segments connecting black-to-green dots are the residuals, and the annotated RMS gives a single number for fit quality. A good fit shows short segments and a small residual RMS; long segments would indicate the leaf is failing to track the rewritten data — a sign that either the noise term has been inflated to absorb structure or the warping is too crude.

The dashed grey curve is the leaf’s $\mu_{f_2}(x_{\mathrm{med}}, h)$ slice at the median training input. The slice is useful for visualizing what the leaf looks like as a function of $h$ alone, but it is the per-point overlay (green dots + residual segments) that actually measures fit quality, because the leaf’s true predictions depend on both $\mathbf{x}$ and $h$ through the skip connection.

Figure 5: Decomposing what the fitted two-layer DGP learned. Left: the hidden layer’s posterior mean as a function of the input — the warping the model discovered. Black dots show each training input $x_i$ at its fitted hidden value $h_i = \mu_{f_1}(x_i)$; they lie on the warping curve by construction and indicate the input regions where the warping is most strongly constrained by data. Right: a fit-quality view of the leaf layer. Black dots are the rewritten observations $(h_i, y_i)$. Green dots are the leaf layer’s actual per-point predictions $\hat y_i = \mu_{f_2}(x_i, h_i)$. Thin grey segments connect each pair: short segments mean the leaf fits well at that point, long segments are residuals. The dashed grey curve is the leaf’s slice $\mu_{f_2}(x_{\mathrm{med}}, h)$ at the median training input — useful for visualising the leaf’s $h$-dependence but it is the green-dots-and-residuals overlay that quantifies fit quality. The annotated residual RMS gives a single-number summary. Triangles on the bottom of each panel mark the inducing locations of that layer (layer 0’s in $x$-space, layer 1’s projected onto the $h$-axis).

This decomposition is genuinely useful in practice, but reading it requires honesty about what we see. The warping is non-monotonic: layer 0’s posterior mean rises sharply through the transition near $x = 0$, peaks around $x = 0.7$, and bends back downward in the right tail. That dip is not a bug; it is a feature of the DGP’s loss landscape. The skip connection means the leaf layer’s input is $[\mathbf{x}, h]$, so the leaf can use $\mathbf{x}$ directly to disambiguate two inputs that share similar $h$ values. This is the disambiguation in action: training points near $x \approx 0.7$ and near $x \approx 2$ both get mapped to $h \approx 1$ by the warping; the leaf separates them using their $\mathbf{x}$-component. This makes the warping non-identifiable: many different layer-0 warpings produce the same predictive distribution at the leaf, because the leaf compensates for whatever shape the warping happens to have. The optimizer found one such warping; a different initialization would likely produce a different but loss-equivalent shape.

The residual RMS in the right panel — around 0.28 — is substantially larger than the 0.05 noise the data was generated with. The visible long grey segments tell the same story: with $M = 8$ inducing points per layer and $N = 30$ training points, the fitter has chosen a conservative variational posterior that does not pin the leaf down. The diagnostic value of this panel is precisely that we can see this is happening; without the per-point prediction overlay we would only see the dashed slice curve passing through the dot clusters and might conclude the fit is fine. More inducing points, more iterations, or more training data would reduce the residuals at the cost of compute.

The triangles at the bottom of each panel mark each layer’s inducing locations. Layer 0’s inducing points are spread across the input range $\mathbf{x}$. Layer 1’s inducing locations live in the augmented input space $[\mathbf{x}, h]$; the markers project them onto the hidden axis $h$ and show that the leaf layer has concentrated its inducing points around the hidden values it actually receives during training, not the wider prior range. This adaptive placement is what the fitter is doing for you while it optimizes the kernel hyperparameters.

A note on identifiability: the parameters of a DGP are not uniquely identified by the data. Many different combinations of $(f_1, f_2)$ can produce the same composition $f_2 \circ f_1$ — the non-monotonic warping above is a concrete example of this. The predictive distribution at the leaf is identified, but the individual layers are only identified up to such re-parameterizations. The plot above shows one valid decomposition, not the unique one. Tutorials that show a tidy monotonic warping are showing a fit that happened to land in a tidy basin; in general one should expect the warping to be one of many loss-equivalent shapes.

When to Use a DGP

A DGP is the right tool when:

The function has non-uniform smoothness: rough in some regions, smooth in others.
The function has discontinuities or sharp transitions that a stationary kernel cannot capture without ringing.
You have enough data for the deep model to find a non-trivial decomposition. Roughly: a 2-layer DGP needs at least a few dozen training points before it outperforms a single SVGP on smooth problems.
You have multi-fidelity data that is naturally hierarchical: low-fidelity observations feeding into a higher-fidelity model. Multi-fidelity DGPs are the subject of the multi-fidelity DGP tutorial.

A DGP is not the right tool when:

The function is genuinely smooth and stationary. A single SVGP will fit it as well or better, with simpler optimization and Gaussian predictives.
You have very few training points. Deep models are harder to identify with sparse data; a single GP with carefully chosen kernel hyperparameters often wins.
You need closed-form Gaussian predictives downstream. The DGP’s non-Gaussian predictive is more flexible but breaks any code that assumes Gaussianity.

Summary

A Deep Gaussian Process composes $L$ SVGPs. Each layer is stationary in its own input space, but the composition learns input-dependent length scales without a non-stationary kernel. Inducing points at every layer keep inference tractable. The variational lower bound has the same shape as for SVGP — a data-fit term minus per-layer KL divergences — but the data-fit term must be estimated by propagating samples through the chain because the joint hidden-layer distribution is non-Gaussian.

Three practical lessons from the figures in this tutorial:

The DGP is more accurate and better calibrated than a single GP on non-stationary problems — but at tutorial scales ($N=30$, $M=8$) the improvement is moderate, not dramatic. The DGP’s main calibration advantage is that it errs on the side of underconfidence rather than overconfidence.
The predictive distribution is generally non-Gaussian, especially in extrapolation, even with a Gaussian likelihood. Reporting only mean and standard deviation throws away the structure that makes the prediction informative; use the full predictive samples for downstream tasks.
The fitted layers are non-identifiable. Many different warpings produce the same predictive distribution because the skip connection lets the leaf layer compensate for any warping shape. A fitted warping is one of many loss-equivalent decompositions, not a unique recovery of underlying structure.

The next tutorial in this sequence applies the chain DGP to multi-fidelity modeling: Multi-fidelity Modeling with Deep Gaussian Processes.

References

Damianou, A. & Lawrence, N. (2013). Deep Gaussian Processes. AISTATS. The original DGP paper.
Salimbeni, H. & Deisenroth, M. (2017). Doubly Stochastic Variational Inference for Deep Gaussian Processes. NeurIPS. The standard reference for modern DGP training; introduces the skip-connection convention used in PyApprox.
Cutajar, K., Pullin, M., Damianou, A., Lawrence, N., & González, J. (2019). Deep Gaussian Processes for Multi-fidelity Modeling. arXiv:1903.07320. Multi-fidelity DGPs and their AR1 reduction — the basis for the multi-fidelity DGP tutorial.

--- title: "Deep Gaussian Processes" subtitle: "PyApprox Tutorial Library" description: "Composing GPs into hierarchical models that learn input-dependent length scales, capture sharp features, and produce non-Gaussian predictive distributions while preserving principled uncertainty." tutorial_type: concept topic: surrogate_modeling difficulty: intermediate estimated_time: 10 prerequisites: - svgp_concept tags: - surrogate - gaussian-process - deep-gp - hierarchical-model - non-stationary format: html: code-fold: false code-tools: true toc: true execute: echo: true warning: false jupyter: python3 --- ::: {.callout-tip collapse="true"} ## Download Notebook [Download as Jupyter Notebook](notebooks/dgp_concept.ipynb) ::: ## Learning Objectives After completing this tutorial, you will be able to: - Explain what a Deep Gaussian Process (DGP) is and how composing GP layers produces non-stationary behavior from stationary ingredients - Identify situations where a single-layer GP is fundamentally limited and a DGP is appropriate - Recognize that the DGP predictive distribution is generally **not** Gaussian, even with a Gaussian likelihood - Describe the role of inducing points at every layer in making DGP inference tractable - Anticipate the inference and identifiability challenges that motivate the doubly-stochastic variational framework treated in later tutorials ## Prerequisites Complete [Sparse Variational Gaussian Processes](svgp_concept.qmd) before this tutorial. The DGP construction reuses the SVGP machinery — inducing points, the variational posterior $q(\mathbf{u})$, the ELBO — at every layer of the deep model. ## Why Compose GPs? The Gaussian process from earlier tutorials uses a **stationary kernel**: the covariance between two function values $f(\mathbf{x})$ and $f(\mathbf{x}')$ depends only on the displacement $\mathbf{x} - \mathbf{x}'$, not on the absolute locations. This gives a single GP its characteristic behavior: one length scale that applies everywhere, and uncertainty bands that close down near data and open up away from it. Stationarity is reasonable for many engineering responses, but it fails on a surprising number of problems. Two diagnostic situations: 1. **Discontinuities or sharp transitions.** A stationary GP cannot resolve a step without ringing — the same length scale that needs to be short at the discontinuity makes the GP overfit smooth regions on either side. 2. **Non-uniform smoothness.** A response that is rough in one region and smooth in another (shock fronts in CFD, phase transitions, regime-switching dynamics) cannot be well-represented by any single length scale. The figure below illustrates the first failure mode and contrasts it with what a two-layer DGP produces on the same data. ```{python} #| echo: false #| label: fig-step-function-comparison #| fig-cap: "Single-layer SVGP (left) versus a two-layer DGP (right) fitted on the same 1D step-function data with the same kernel family and the same Adam → L-BFGS-B fitter — only the depth differs. The single GP smooths the transition with a single length scale and ends up overconfident: its uncertainty band hugs the smooth interpolant tightly, even though the truth (dashed) is a sharp step. The DGP makes a noticeably sharper transition and, importantly, *widens* its uncertainty band in the regions where it has less information — honest behavior the single GP cannot produce. The annotation boxes give test-set RMSE and mean log-likelihood on 200 held-out points; the DGP improves RMSE by 2× and improves log-likelihood by an order of magnitude, the latter dominated by the single GP's mis-calibrated bands." import matplotlib.pyplot as plt from pyapprox_tutorials.figures._dgp import plot_step_function_comparison fig, axes = plt.subplots(1, 2, figsize=(11.5, 4.0)) plot_step_function_comparison(axes) plt.tight_layout() plt.show() ``` The fix is *not* to invent a more elaborate stationary kernel. It is to make the kernel itself **input-dependent** by composing GPs. ## How Good Is the Approximation? The figures so far make a visual case for the DGP, but visual impressions are easy to mislead. Two quantitative checks anchor what we see. First, **point-prediction accuracy**: the small annotation boxes on each panel of the comparison figure above show test-set RMSE and mean log-likelihood on a held-out set of 200 inputs sampled from the same step function. The DGP's RMSE is roughly half the single GP's. The mean log-likelihood gap is much larger, and the difference is informative: log-likelihood rewards a model that gets both the mean *and* the uncertainty right. The single GP's log-likelihood is strongly negative because its bands are too narrow at the discontinuity, so the truth falls outside the predictive distribution and contributes a large penalty. The DGP's log-likelihood is positive because its wider, more honest bands assign reasonable density to the true values. Second, **calibration of predictive intervals**: even a model with low RMSE can produce dishonest uncertainty bands. To check this we ask, at each nominal credible-interval level (say 90%), what fraction of held-out points actually fall inside the predicted band? A perfectly calibrated model gives empirical coverage equal to the nominal level. A model whose curve sits *below* the diagonal is overconfident — its bands are too narrow. A model above the diagonal is underconfident — bands too wide. ```{python} #| echo: false #| label: fig-calibration #| fig-cap: "Reliability diagram for the single-layer SVGP and the two-layer DGP on the step-function held-out test set. The horizontal axis is the nominal credible-interval level; the vertical axis is the empirical fraction of test points falling inside the corresponding band. Points on the diagonal indicate perfect calibration. The single GP sits well below the diagonal across all levels: it is severely overconfident, its bands too narrow to cover the true step at the discontinuity. The DGP sits above the diagonal at low-to-mid nominal levels and approaches it at the highest level. The DGP is underconfident — its bands are wider than they strictly need to be — but never *over*confident, which is the more dangerous failure mode in scientific applications." import matplotlib.pyplot as plt from pyapprox_tutorials.figures._dgp import plot_calibration_curves fig, ax = plt.subplots(1, 1, figsize=(6.5, 5.0)) plot_calibration_curves(ax) plt.tight_layout() plt.show() ``` Neither model sits exactly on the diagonal, but their failure modes are different. The single GP is overconfident: it claims sharp uncertainty bands that don't actually contain the truth at the discontinuity. The DGP is underconfident: it claims wider bands than it strictly needs, but the bands do contain the truth. The two have very different consequences for downstream use. An overconfident model gives you false certainty about wrong predictions — a quietly dangerous failure. An underconfident model gives you honest predictions with conservative uncertainty bounds — less efficient but safer. This trade is a real one. The DGP's wider bands are not a free lunch: in regions where the data is dense and the function is locally simple, the DGP could in principle produce sharper bands. The chained Adam → L-BFGS-B optimizer with $M=8$ inducing points per layer on $N=30$ training points has not pinned the variational posterior down tightly enough to be sharper. More inducing points, more iterations, or more training data would all narrow the gap. For a tutorial-scale fit on a difficult step-function problem, the DGP's behavior is exactly what calibrated probabilistic regression should look like: don't overstate certainty. ## The Construction A two-layer Deep Gaussian Process stacks two GPs: $$ f_1 \sim \mathcal{GP}\!\bigl(0,\, k_1(\mathbf{x}, \mathbf{x}')\bigr), \qquad f_2 \sim \mathcal{GP}\!\bigl(0,\, k_2(\mathbf{h}, \mathbf{h}')\bigr). $$ The output of the first GP becomes the input of the second: $$ y = f_2\!\bigl(f_1(\mathbf{x})\bigr) + \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0,\sigma^2). $$ Each kernel is stationary in *its own* input space, but the **composition** is not stationary in $\mathbf{x}$. The first layer learns a warping of the input; the second layer is a smooth function on the warped space. Together they produce input-dependent effective length scales without ever calling on a non-stationary kernel. @fig-composition-cartoon decomposes this composition for a concrete example with $f_1(x) = 1.6\tanh(2.5x)$ (a smooth sigmoid) and $f_2(h) = 0.8h + 0.4\sin(2h)$ (a smooth, mildly wavy near-linear function). A smooth $f_1$ (left) maps the input to a warped hidden space. A smooth $f_2$ (middle) is defined on that hidden space. The composition $f_2 \circ f_1$ (right) is sharp where $f_1$ is steep and gentle where $f_1$ is flat — even though both ingredients individually are smooth. This is the mechanism behind the DGP's behavior in @fig-step-function-comparison: layer 0 finds a steep warping near the discontinuity, and the composition with the leaf produces the sharp transition without any non-stationary kernel. ```{python} #| echo: false #| label: fig-composition-cartoon #| fig-cap: "How two stationary GPs compose into a non-stationary effect. Layer 1 (blue) is a smooth warping of the input. Layer 2 (orange) is a smooth function of the warped value. The composition (green) is steep where $f_1$ is steep, producing the sharp transition needed for step-like data without any non-stationary kernel." import matplotlib.pyplot as plt from pyapprox_tutorials.figures._dgp import plot_composition_cartoon fig, axes = plt.subplots(1, 3, figsize=(13.0, 4.0)) plot_composition_cartoon(axes) plt.tight_layout() plt.show() ``` The same construction extends to $L$ layers: $$ f_L \circ f_{L-1} \circ \cdots \circ f_1, \qquad f_\ell \sim \mathcal{GP}\!\bigl(0,\, k_\ell\bigr) \text{ for } \ell = 1, \dots, L. $$ In practice $L=2$ or $L=3$ already captures most of the additional flexibility, and going deeper introduces inference difficulties that outweigh the gains for most surrogate-modeling applications. ## A Practical Refinement: Skip Connections The pure-composition construction has a known pathology: if any intermediate layer learns the identity ($f_\ell(\mathbf{h}) \approx \mathbf{h}$), the model behaves like a shallower DGP. Worse, the optimizer can land at solutions where deeper layers have *less* representational capacity than a single GP because the composition has degenerated. Modern DGPs avoid this by passing the original input forward to every layer: $ \mathbf{h}_\ell = f_\ell\!\bigl([\mathbf{x},\, \mathbf{h}_{\ell-1}]\bigr). $ Each hidden layer sees the original $\mathbf{x}$ concatenated with its parent's output. This **skip connection** preserves the input information regardless of what the hidden layers do, and it makes the reduction to a single GP a clean special case rather than a degenerate basin in parameter space. PyApprox's `build_single_fidelity_dgp` uses skip connections by default. The pure-composition version above is the cleanest framing for *what* a DGP is; skip connections are the framing that makes it *trainable*. ## Inference: Inducing Points at Every Layer The single-layer SVGP from the previous tutorial introduced inducing points $\mathbf{Z}$ as a compressed summary of the GP at $M \ll N$ chosen locations, with a variational distribution $q(\mathbf{u}) = \mathcal{N}(\mathbf{m}, S)$ over the function values at those locations. The ELBO bounded the log marginal likelihood from below and was tractable to optimize. A DGP with $L$ layers needs all of this, $L$ times over. Each layer $\ell$ has its own: - inducing locations $\mathbf{Z}_\ell$ in the input space appropriate for that layer, - variational distribution $q(\mathbf{u}_\ell) = \mathcal{N}(\mathbf{m}_\ell, S_\ell)$, - kernel hyperparameters $\boldsymbol\theta_\ell$. The deep ELBO has the same shape as the SVGP ELBO: a data-fit term minus a sum of KL divergences, one per layer: $$ \text{ELBO} \;=\; \sum_{i=1}^{N} \mathbb{E}_{q(f_L(\mathbf{x}_i))}\!\bigl[\log p(y_i \mid f_L(\mathbf{x}_i))\bigr] \;-\; \sum_{\ell=1}^{L} \mathrm{KL}\!\bigl[q(\mathbf{u}_\ell) \,\big\|\, p(\mathbf{u}_\ell)\bigr]. $$ The KL terms have closed-form Gaussian-Gaussian expressions. The data-fit term is the one that gets harder: $f_L(\mathbf{x}_i)$ is the leaf-layer value, but the leaf layer's input is the previous layer's *output*, which is itself a random function of $\mathbf{x}_i$. The expectation marginalizes over the joint posterior of all hidden layers, which is no longer Gaussian. This is where the **doubly-stochastic** machinery comes in: one source of stochasticity is the variational distribution $q(\mathbf{u}_\ell)$ at each layer, the second is the propagation of samples between layers. Practical DGP fitting marginalizes $\mathbf{u}_\ell$ analytically inside each layer, then propagates Monte Carlo (or Gauss--Hermite, or quasi-Monte-Carlo) samples through the chain to estimate the data-fit term. For now the practical takeaway is this: fitting a DGP is **structurally** like fitting $L$ SVGPs simultaneously, with samples from each layer feeding the next. The cost per gradient step is roughly $L$ times the cost of a single SVGP step. Convergence is harder than for a single SVGP because the joint loss is non-convex. ## What the Predictive Distribution Looks Like The single-layer SVGP gives a Gaussian predictive distribution at every test point. The DGP does not. Even with a Gaussian likelihood and Gaussian variational distributions at every layer, the predictive distribution at a new $\mathbf{x}^*$ is: $$ q(f_L(\mathbf{x}^*)) \;=\; \int\! q\bigl(f_L \mid f_{L-1}\bigr)\, q\bigl(f_{L-1} \mid f_{L-2}\bigr) \,\cdots\, q\bigl(f_1 \mid \mathbf{x}^*\bigr) \, df_{L-1} \cdots df_1. $$ Each conditional $q(f_\ell \mid f_{\ell-1})$ is Gaussian, but the marginal of $f_L(\mathbf{x}^*)$ — averaging over all the intermediate randomness — is a mixture and need not be Gaussian. In regions where the layer-1 warping is steep, the marginal can be markedly skewed or heavy-tailed. The figure below makes this concrete. The same fitted DGP is queried at three test locations chosen to land in qualitatively different regions: one in a smooth region away from the warping, one inside the transition region near $x = 0$, and one in extrapolation territory. At each location the histogram of predictive samples is overlaid with the Gaussian whose mean and standard deviation match the samples. In the smooth region and inside the transition the histograms are well-approximated by a Gaussian — a 2-layer DGP fitted to this data has not produced much non-Gaussianity at training inputs. In extrapolation, however, the histogram is strongly skewed: a tall sharp peak near $f(x) = 1$ with a long tail reaching down toward $-1$, completely unlike the matched Gaussian. The non-Gaussian regime is where the DGP's predictive flexibility shows up most clearly. ```{python} #| echo: false #| label: fig-predictive-density #| fig-cap: "Predictive distributions at three test points from the same fitted two-layer DGP. In the smooth region (left, blue) and inside the transition (center, orange) the predictive samples are well-approximated by a Gaussian (dashed black) — the moment-matched Gaussian and the histogram are in close agreement. In extrapolation (right, purple) the histogram is strongly skewed and bimodal-looking: a sharp peak near $+1$ with a long tail reaching down toward $-1$. The Gaussian with the same mean and standard deviation is visibly inadequate. This is the regime where the DGP's predictive distribution genuinely needs the full sample histogram, not just its first two moments." import matplotlib.pyplot as plt from pyapprox_tutorials.figures._dgp import plot_predictive_density fig, axes = plt.subplots(1, 3, figsize=(13.0, 4.0)) plot_predictive_density(axes) plt.tight_layout() plt.show() ``` The extrapolation regime is where the DGP's non-Gaussianity matters most in practice. When the model is asked to predict outside the training data, it is also asked to *propagate* its uncertainty about the warping forward through the chain. That propagation mixes a wide hidden-layer posterior through a non-trivial leaf, producing the kind of skewed predictive seen on the right. Collapsing this to mean and standard deviation throws away the structure that makes the prediction informative — you would conclude "the model is uncertain about everything between $-1$ and $+1$" when in fact the model has put most of its predictive mass on a specific narrow band near $+1$, with a long lower tail accounting for the chance that the warping had a different shape. In code, this means you should not collapse a DGP's predictive distribution to its mean and standard deviation when reporting uncertainty in extrapolation regions. Use the full predictive samples for downstream tasks like uncertainty propagation, calibration analysis, or risk-aware decision making. ## Inspecting What the DGP Learned A fitted DGP is partially interpretable. The first layer's posterior mean $\mu_{f_1}(\mathbf{x})$ is itself a function of the original input — it is the warping that the data found useful. The leaf layer's posterior mean $\mu_{f_2}(\mathbf{h})$ is a smooth function of the hidden value. Plotting them separately reveals the structure the DGP discovered. Note that the left panel of @fig-latent-representation does *not* show the same quantity as @fig-step-function-comparison. Figure 1's vertical axis is $f(x)$, the model's prediction of the observed $y$. The left panel below has $f_1(x)$ on its vertical axis — the *hidden* value the warping assigns to each input. The black dots are the same training inputs $x_i$ from Figure 1, but they are plotted at their fitted hidden values $h_i = \mu_{f_1}(x_i)$, not at their observed $y_i$. A helpful way to read the right panel of the figure below: the deep model has effectively rewritten the training set. The original observations are pairs $(x_i, y_i)$; the leaf layer is performing GP regression not on those, but on the *rewritten* pairs $(h_i, y_i)$, where $h_i = \mu_{f_1}(x_i)$ is the hidden value the warping assigned to training input $x_i$. The black dots in the right panel show this rewritten dataset, and the green dots show the leaf's per-point predictions $\hat y_i = \mu_{f_2}(x_i, h_i)$ at each training input. The thin grey segments connecting black-to-green dots are the residuals, and the annotated RMS gives a single number for fit quality. A good fit shows short segments and a small residual RMS; long segments would indicate the leaf is failing to track the rewritten data — a sign that either the noise term has been inflated to absorb structure or the warping is too crude. The dashed grey curve is the leaf's $\mu_{f_2}(x_{\mathrm{med}}, h)$ slice at the median training input. The slice is useful for *visualizing* what the leaf looks like as a function of $h$ alone, but it is the per-point overlay (green dots + residual segments) that actually measures fit quality, because the leaf's true predictions depend on both $\mathbf{x}$ and $h$ through the skip connection. ```{python} #| echo: false #| label: fig-latent-representation #| fig-cap: "Decomposing what the fitted two-layer DGP learned. Left: the hidden layer's posterior mean as a function of the input — the warping the model discovered. Black dots show each training input $x_i$ at its fitted hidden value $h_i = \\mu_{f_1}(x_i)$; they lie on the warping curve by construction and indicate the input regions where the warping is most strongly constrained by data. Right: a fit-quality view of the leaf layer. Black dots are the rewritten observations $(h_i, y_i)$. Green dots are the leaf layer's actual per-point predictions $\\hat y_i = \\mu_{f_2}(x_i, h_i)$. Thin grey segments connect each pair: short segments mean the leaf fits well at that point, long segments are residuals. The dashed grey curve is the leaf's slice $\\mu_{f_2}(x_{\\mathrm{med}}, h)$ at the median training input — useful for visualising the leaf's $h$-dependence but it is the green-dots-and-residuals overlay that quantifies fit quality. The annotated residual RMS gives a single-number summary. Triangles on the bottom of each panel mark the inducing locations of that layer (layer 0's in $x$-space, layer 1's projected onto the $h$-axis)." import matplotlib.pyplot as plt from pyapprox_tutorials.figures._dgp import plot_latent_representation fig, axes = plt.subplots(1, 2, figsize=(11.5, 4.0)) plot_latent_representation(axes) plt.tight_layout() plt.show() ``` This decomposition is genuinely useful in practice, but reading it requires honesty about what we see. **The warping is non-monotonic**: layer 0's posterior mean rises sharply through the transition near $x = 0$, peaks around $x = 0.7$, and bends back downward in the right tail. That dip is not a bug; it is a feature of the DGP's loss landscape. The skip connection means the leaf layer's input is $[\mathbf{x}, h]$, so the leaf can use $\mathbf{x}$ directly to disambiguate two inputs that share similar $h$ values. This is the disambiguation in action: training points near $x \approx 0.7$ and near $x \approx 2$ both get mapped to $h \approx 1$ by the warping; the leaf separates them using their $\mathbf{x}$-component. This makes the warping *non-identifiable*: many different layer-0 warpings produce the same predictive distribution at the leaf, because the leaf compensates for whatever shape the warping happens to have. The optimizer found one such warping; a different initialization would likely produce a different but loss-equivalent shape. The residual RMS in the right panel — around 0.28 — is substantially larger than the 0.05 noise the data was generated with. The visible long grey segments tell the same story: with $M = 8$ inducing points per layer and $N = 30$ training points, the fitter has chosen a conservative variational posterior that does not pin the leaf down. The diagnostic value of this panel is precisely that we can *see* this is happening; without the per-point prediction overlay we would only see the dashed slice curve passing through the dot clusters and might conclude the fit is fine. More inducing points, more iterations, or more training data would reduce the residuals at the cost of compute. The triangles at the bottom of each panel mark each layer's inducing locations. Layer 0's inducing points are spread across the input range $\mathbf{x}$. Layer 1's inducing locations live in the augmented input space $[\mathbf{x}, h]$; the markers project them onto the hidden axis $h$ and show that the leaf layer has concentrated its inducing points around the hidden values it actually receives during training, not the wider prior range. This adaptive placement is what the fitter is doing for you while it optimizes the kernel hyperparameters. A note on identifiability: the parameters of a DGP are not uniquely identified by the data. Many different combinations of $(f_1, f_2)$ can produce the same composition $f_2 \circ f_1$ — the non-monotonic warping above is a concrete example of this. The *predictive distribution* at the leaf is identified, but the individual layers are only identified up to such re-parameterizations. The plot above shows one valid decomposition, not the unique one. Tutorials that show a tidy monotonic warping are showing a fit that happened to land in a tidy basin; in general one should expect the warping to be one of many loss-equivalent shapes. ## When to Use a DGP A DGP is the right tool when: - The function has **non-uniform smoothness**: rough in some regions, smooth in others. - The function has **discontinuities or sharp transitions** that a stationary kernel cannot capture without ringing. - You have **enough data** for the deep model to find a non-trivial decomposition. Roughly: a 2-layer DGP needs at least a few dozen training points before it outperforms a single SVGP on smooth problems. - You have **multi-fidelity data** that is naturally hierarchical: low-fidelity observations feeding into a higher-fidelity model. Multi-fidelity DGPs are the subject of the [multi-fidelity DGP tutorial](dgp_multifidelity_concept.qmd). A DGP is **not** the right tool when: - The function is genuinely smooth and stationary. A single SVGP will fit it as well or better, with simpler optimization and Gaussian predictives. - You have very few training points. Deep models are harder to identify with sparse data; a single GP with carefully chosen kernel hyperparameters often wins. - You need closed-form Gaussian predictives downstream. The DGP's non-Gaussian predictive is more flexible but breaks any code that assumes Gaussianity. ## Summary A Deep Gaussian Process composes $L$ SVGPs. Each layer is stationary in its own input space, but the composition learns input-dependent length scales without a non-stationary kernel. Inducing points at every layer keep inference tractable. The variational lower bound has the same shape as for SVGP — a data-fit term minus per-layer KL divergences — but the data-fit term must be estimated by propagating samples through the chain because the joint hidden-layer distribution is non-Gaussian. Three practical lessons from the figures in this tutorial: - The DGP is **more accurate and better calibrated than a single GP** on non-stationary problems — but at tutorial scales ($N=30$, $M=8$) the improvement is moderate, not dramatic. The DGP's main calibration advantage is that it errs on the side of underconfidence rather than overconfidence. - The predictive distribution is **generally non-Gaussian, especially in extrapolation**, even with a Gaussian likelihood. Reporting only mean and standard deviation throws away the structure that makes the prediction informative; use the full predictive samples for downstream tasks. - The fitted layers are **non-identifiable**. Many different warpings produce the same predictive distribution because the skip connection lets the leaf layer compensate for any warping shape. A fitted warping is one of many loss-equivalent decompositions, not a unique recovery of underlying structure. The next tutorial in this sequence applies the chain DGP to multi-fidelity modeling: [Multi-fidelity Modeling with Deep Gaussian Processes](dgp_multifidelity_concept.qmd). ## References - Damianou, A. & Lawrence, N. (2013). *Deep Gaussian Processes*. AISTATS. The original DGP paper. - Salimbeni, H. & Deisenroth, M. (2017). *Doubly Stochastic Variational Inference for Deep Gaussian Processes*. NeurIPS. The standard reference for modern DGP training; introduces the skip-connection convention used in PyApprox. - Cutajar, K., Pullin, M., Damianou, A., Lawrence, N., & González, J. (2019). *Deep Gaussian Processes for Multi-fidelity Modeling*. arXiv:1903.07320. Multi-fidelity DGPs and their AR1 reduction — the basis for the multi-fidelity DGP tutorial.