Practical MCMC: Metropolis-Hastings and DRAM

PyApprox Tutorial Library

Tuning proposal distributions, adaptive methods, and convergence diagnostics for MCMC.

Learning Objectives

After completing this tutorial, you will be able to:

  • Diagnose the effect of proposal width on chain mixing: too narrow, too wide, and well-tuned
  • Use acceptance rate as a tuning diagnostic
  • Explain the burn-in period and use autocorrelation to assess effective sample size
  • Describe the Delayed Rejection (DR) mechanism and why it helps stuck chains
  • Describe the Adaptive Metropolis (AM) mechanism and why it helps in correlated posteriors
  • Use DRAM (Delayed Rejection Adaptive Metropolis) as a practical default algorithm

Prerequisites

Complete Sampling the Posterior with MCMC before this tutorial.

Setup

We use the same KLE beam model from the previous tutorial: a cantilever beam whose bending stiffness \(EI(x)\) varies along its length, parameterized by a 2-term KLE expansion with coefficients \(\xi_1, \xi_2 \sim \mathcal{N}(0,1)\). We observe tip deflection and infer the two KLE coefficients.

Proposal Width Matters

The previous tutorial used a well-tuned proposal and the chain worked. But how did we know the right proposal width? Figure 1 shows what happens when the proposal is too narrow, well-tuned, and too wide.

Figure 1: Effect of proposal width on chain behavior. Left: too narrow — the chain takes tiny steps and explores slowly (high acceptance, high autocorrelation). Center: well-tuned — healthy mixing, the chain moves freely through the posterior. Right: too wide — most proposals land in low-density regions and are rejected, causing the chain to stick in place for long stretches.

The patterns:

  • Too narrow (top): acceptance rate is very high (\(>95\%\)) because the proposals are tiny. But the chain barely moves — it takes thousands of steps to traverse the posterior. The 2D scatter shows a tight clump rather than the full distribution.
  • Well-tuned (middle): acceptance rate is moderate (\(20\text{--}40\%\)). The chain moves freely, and the 2D scatter fills out the posterior shape.
  • Too wide (bottom): acceptance rate is very low (\(<5\%\)) because most proposals land far from the current position. The trace plot shows long flat stretches where the chain is stuck. The 2D scatter is sparse and clumpy.

Acceptance Rate as a Diagnostic

The acceptance rate provides a quick diagnostic. Figure 2 sweeps the proposal width and plots the acceptance rate alongside the effective sample size (a measure of how many independent samples the chain produces).

Figure 2: Acceptance rate and effective sample size (ESS) vs. proposal width. The optimal proposal width produces an acceptance rate of roughly \(25\text{--}40\%\) for 2D problems. Too narrow → high acceptance but low ESS. Too wide → low acceptance and low ESS.

The effective sample size peaks near the acceptance rate of \(\approx 25\%\), consistent with theoretical results for random-walk Metropolis in moderate dimensions. This gives a practical rule of thumb: tune the proposal until the acceptance rate is \(20\text{--}40\%\).

Burn-In and Autocorrelation

Burn-In

The chain starts wherever we place it and needs time to find the high-density region. Samples from this transient period are not representative of the posterior and must be discarded. This is the burn-in period.

Figure 3 shows how the choice of burn-in length affects the posterior estimate.

Figure 3: Effect of burn-in removal. The red histogram (all samples) is visibly shifted from the true value because the burn-in transient pulls the distribution toward the starting point. The blue histogram (burn-in removed) centers on the true value. The inset trace plot shows the transient approach phase (gray region).

Autocorrelation

Even after burn-in, consecutive chain samples are correlated — the walker’s position at step \(t\) depends on where it was at step \(t-1\). The autocorrelation function measures how quickly this dependence decays.

Figure 4 compares the autocorrelation for the three proposal widths. A well-tuned chain decorrelates quickly; a poorly tuned chain stays correlated for many steps.

Figure 4: Autocorrelation of the \(\xi_1\) chain for three proposal widths. The well-tuned chain (green) decorrelates within ~20 steps. The too-narrow chain (blue) stays correlated for hundreds of steps. The too-wide chain (red) also decorrelates slowly due to frequent rejections.

Delayed Rejection (DR)

Standard Metropolis-Hastings has a binary outcome: accept or reject. If a proposal is rejected, the chain stays at the current position and tries again from scratch. This is wasteful — the rejected proposal told us that the proposed direction might be fine, but the step was too large.

Delayed Rejection adds a second chance: if the first proposal is rejected, try a second proposal with a smaller step size. The acceptance probability for the second stage is adjusted to maintain the correct stationary distribution.

Figure 5 illustrates the mechanism on the 1D KLE beam posterior.

Figure 5: Delayed Rejection in action. Top: the first proposal (large step) lands in a low-density region and is rejected. Bottom: a second proposal (smaller step) is tried from the same current position. It lands in a higher-density region and is accepted. Without DR, the chain would have stayed put.

The key idea: DR gives the chain a second chance with a more conservative step, preventing it from getting stuck after an aggressive proposal is rejected. This is especially useful when the posterior has regions of very different curvature.

Adaptive Metropolis (AM)

Standard Metropolis-Hastings uses a fixed proposal distribution. But the optimal proposal depends on the posterior’s shape, which we don’t know in advance. Adaptive Metropolis solves this by learning the proposal covariance from the chain history:

\[ \mathbf{C}_{\text{prop}}^{(t)} = s_d \cdot \hat{\boldsymbol{\Sigma}}_t + \varepsilon \mathbf{I} \]

where \(\hat{\boldsymbol{\Sigma}}_t\) is the sample covariance of the chain so far, \(s_d = 2.4^2 / d\) is a scaling factor, and \(\varepsilon \mathbf{I}\) is a small regularization term.

Figure 6 shows the effect on the KLE beam problem. Initially the proposal is isotropic (circular). As the chain runs, the proposal adapts to match the posterior’s elongated shape.

Figure 6: Adaptive Metropolis learns the posterior shape. Left: initial isotropic proposal (circle) — the chain makes many rejected moves perpendicular to the posterior ridge. Right: after adaptation, the proposal aligns with the posterior’s correlation structure (tilted ellipse) — the chain moves efficiently along the ridge.

The adapted proposal ellipse is tilted to align with the posterior’s correlation between \(\xi_1\) and \(\xi_2\). This means the chain proposes moves along the high-density ridge rather than across it, dramatically improving mixing.

DRAM: Combining DR and AM

DRAM (Delayed Rejection Adaptive Metropolis) combines both ideas:

  • Adaptive Metropolis learns the proposal shape from the chain → efficient moves along the posterior ridge
  • Delayed Rejection provides a fallback when an adaptive proposal is rejected → the chain unsticks faster

Figure 7 compares the four methods on the KLE beam problem. All chains use the same number of steps and start from the same initial point.

Figure 7: Comparison of four MCMC variants on the KLE beam problem (3,000 steps each). Top row: trace plots of \(\xi_1\). Bottom row: 2D scatter of posterior samples with ESS annotations. Focus on ESS rather than acceptance rate — DR achieves 90% acceptance but the lowest ESS, because most accepted moves are tiny second-stage steps that barely advance the chain.

Several observations:

  • DR achieves the highest acceptance rate (~90%) but the lowest ESS. Its second-stage proposals are small, so most accepted DR moves barely advance the chain. High acceptance rate is misleading here.
  • AM achieves the real improvement by learning the posterior’s correlation structure. Its ESS is dramatically higher than both MH and DR.
  • DRAM combines both mechanisms, matching or exceeding AM’s efficiency on this near-Gaussian posterior while providing a safety net for harder problems where the adapted proposal occasionally proposes too aggressively.

For most problems, DRAM is a good default choice — it is at least as good as AM and more robust on non-Gaussian posteriors.

Practical Guidance

Based on what we’ve seen:

  1. Start with DRAM unless you have a reason to use something simpler.
  2. Check the acceptance rate: aim for \(20\text{--}40\%\) for standard MH and AM in moderate dimensions. DR and DRAM will report higher rates because the second-stage proposals recover first-stage rejections — focus on ESS rather than acceptance rate for these methods.
  3. Inspect trace plots: look for the chain settling into a stationary pattern. Flat stretches indicate the chain is stuck.
  4. Remove burn-in: discard the initial transient. When in doubt, be generous — discarding too many samples is safer than including biased ones.
  5. Check autocorrelation: if the chain is highly correlated, you need a longer run to get enough effective samples.
  6. Run multiple chains from different starting points. If they converge to the same distribution, this is strong evidence that the chains have mixed properly.
NoteWhen MCMC is too expensive

Each MCMC step requires one model evaluation. For this 1D cantilever beam with a fast FEM solver, 3,000 evaluations are trivial. But for a 3D FEM model that takes minutes per evaluation, 5,000 evaluations is prohibitive. In such cases, replacing the expensive model with a polynomial chaos surrogate inside the MCMC loop can reduce the cost by orders of magnitude.

Key Takeaways

  • Proposal width controls chain behavior: too narrow → slow exploration; too wide → frequent rejection; well-tuned → efficient mixing
  • The acceptance rate is a quick diagnostic: aim for \(20\text{--}40\%\) in moderate dimensions
  • Burn-in must be removed; autocorrelation determines the effective sample size
  • Delayed Rejection (DR) gives the chain a second chance with a smaller step when the first proposal is rejected
  • Adaptive Metropolis (AM) learns the proposal covariance from the chain history, aligning the proposal with the posterior shape
  • DRAM combines both and is a good default for most problems

Exercises

  1. Run standard MH on the KLE beam with acceptance rates of 10%, 25%, and 50% (adjust proposal width to achieve each). Compare the effective sample size. Which is best?

  2. Modify the DR implementation to use three stages instead of two (a third, even smaller proposal). Does the acceptance rate improve? Does the effective sample size change?

  3. In the AM algorithm, the adaptation starts after 200 steps (adaptation_start = 200). What happens if you start adapting immediately (\(t = 1\))? What if you wait until \(t = 1000\)? Why does the choice matter?

  4. (Challenge) Add a second observation (stress measurement in addition to deflection) to the 2D problem. How does the posterior shape change? Does DRAM still outperform standard MH?