Importance Sampling for Rare Events

PyApprox Tutorial Library

When Monte Carlo needs millions of samples to see a single failure, importance sampling deliberately targets the tail and corrects with likelihood ratios.

Download Notebook

Download as Jupyter Notebook

Learning Objectives

After completing this tutorial, you will be able to:

Explain why Monte Carlo is inefficient for small failure probabilities and quantify the cost
Formulate a failure probability as an integral and identify the variance bottleneck
Derive the importance sampling estimator and explain the likelihood ratio correction
Demonstrate empirically that importance sampling reduces variance by orders of magnitude for rare events
Recognize the risk of weight degeneracy from a poor biasing distribution

Prerequisites

Complete Monte Carlo Sampling and Estimator Accuracy and MSE before this tutorial.

Why Rare Events Break Monte Carlo

Estimating the probability of a rare but critical event — a bridge deflecting past its safety limit, a circuit failing under thermal stress — is one of the hardest problems in uncertainty quantification. The difficulty is simple to state: if failures happen one time in ten thousand, then a Monte Carlo experiment with a few hundred samples will almost certainly see zero failures. The estimate $\hat{p} = 0$ is technically unbiased (averaged over infinitely many experiments), but any single experiment is useless.

The QMC tutorial showed that structured point sets accelerate convergence for smooth integrands. But failure probabilities involve the indicator function $\mathbb{1}[f > c]$, which is discontinuous — exactly the case where QMC’s advantage disappears.

We need a fundamentally different idea. Before deriving anything, let’s see the problem.

Seeing the Problem: Where Do the Samples Land?

Consider estimating $p = P(x > 2)$ where $x \sim \mathcal{N}(0,1)$. The true probability is about 2.3%. Figure 1 shows 120 samples drawn from the standard normal, with the failure region $x > 2$ shaded.

Figure 1: 120 samples from the standard normal $\pi$. The failure region $x > 2$ (shaded red) sits in the far tail where $\pi$ has almost no mass. Only 1 sample (red tick mark) lands past the threshold — the other 119 contribute nothing to the failure probability estimate.

The picture tells the whole story: the target density $\pi$ concentrates almost all its mass between $-2$ and $2$. The failure region $x > 2$ sits in the far tail where $\pi$ is negligible — samples simply don’t land there. Of 120 draws, only 1 exceeds the threshold. The other 119 contribute nothing: each returns $\mathbb{1}[x_i > 2] = 0$.

MC Estimation: The Consequence

We just saw that a single MC experiment with 120 samples produced only 1 failure. What happens if we repeat this experiment hundreds of times? Figure 2 runs 500 independent experiments at two sample sizes.

Figure 2: MC estimates of $P(x > 2)$ from 500 independent experiments. Left: $M = 500$ — many experiments see zero failures, producing $\hat{p} = 0$ (the tall spike at the origin). Right: $M = 5{,}000$ — fewer zeros, but estimates still scatter widely around the true value (red dashed line).

The spike at zero in the left panel is exactly what Figure 1 predicted: experiments that drew hundreds of samples without a single one reaching the tail. The estimator is unbiased — averaged over all 500 experiments the mean is close to the true $p$ — but any individual estimate is nearly useless.

We can now quantify the pain. The MC estimator $\hat{p}_M = \frac{1}{M}\sum \mathbb{1}[x_i > c]$ is a sample mean of Bernoulli($p$) random variables. Its variance is $p(1-p)/M \approx p/M$ when $p$ is small, so the relative standard error is:

\[ \frac{\text{SE}[\hat{p}_M]}{p} = \frac{1}{\sqrt{Mp}} \]

To achieve 10% relative error we need $M \approx 100/p$. For $p = 10^{-3}$ that is $100{,}000$ samples; for $p = 10^{-6}$ it is $100$ million. This is the formula behind the spike at zero: with $M = 500$ and $p \approx 0.023$, we need roughly $M \approx 4{,}000$ samples for 10% accuracy, so most 500-sample experiments miss the tail entirely.

What if we could sample from a different distribution — one that puts its mass right where failures occur?

The Key Idea: Sample Where Failures Happen

The fix is to sample from a different distribution $q$ that places mass where failures occur, then correct for the fact that we sampled from the wrong distribution. Figure 3 shows both steps.

Figure 3: The importance sampling fix, continuing the 1D example from Figure 1. **(Left)** A shifted proposal $q$ (blue, centered at 2.5) places most of its mass past the threshold. Of 120 samples, the majority now land in the failure region — compare to the single hit in Figure 1. **(Right)** But we can’t just count them: they were drawn from $q$, not $\pi$. Each sample is placed on the target density $\pi$ and sized by its importance weight $w_i = \pi(x_i) / q(x_i)$. Samples near the threshold carry the largest weights (dark red, large markers) because they are relatively most probable under $\pi$. Samples deep in the tail, where $\pi$ is negligible, are downweighted (yellow, small markers).

The left panel is the “aha”: by shifting the sampling distribution to $q$, the majority of samples now exceed the threshold. But these samples were drawn from $q$, not $\pi$ — we can’t simply count them. The right panel shows the correction: each sample is placed on the target density $\pi$ and scaled by its importance weight $w_i = \pi(x_i)/q(x_i)$. Samples near the threshold are the most valuable — they are relatively probable under $\pi$ and carry the largest weights. Samples deep in the tail, where $q \gg \pi$, contribute almost nothing after reweighting.

This is the entire importance sampling mechanism: sample where it matters, then correct for the bias.

Formalizing the Idea

The importance sampling estimator

The figure above showed the mechanism; now we write it down. The failure probability is an integral:

\[ p = P\bigl(f(\boldsymbol{\theta}) > c\bigr) = \int \mathbb{1}\bigl[f(\boldsymbol{\theta}) > c\bigr]\, \pi(\boldsymbol{\theta})\, d\boldsymbol{\theta} \]

Multiply and divide by the proposal density $q$:

\[ p = \int \mathbb{1}[f(\boldsymbol{\theta}) > c]\, \frac{\pi(\boldsymbol{\theta})}{q(\boldsymbol{\theta})}\, q(\boldsymbol{\theta})\, d\boldsymbol{\theta} \]

Now sample from $q$ instead of $\pi$:

\[ \hat{p}_M^{\text{IS}} = \frac{1}{M} \sum_{i=1}^{M} \mathbb{1}[f(\boldsymbol{\theta}^{(i)}) > c]\, \frac{\pi(\boldsymbol{\theta}^{(i)})}{q(\boldsymbol{\theta}^{(i)})}, \qquad \boldsymbol{\theta}^{(i)} \sim q \]

The ratio $w(\boldsymbol{\theta}) = \pi(\boldsymbol{\theta}) / q(\boldsymbol{\theta})$ is the likelihood ratio or importance weight. Each term maps directly to something in Figure 3:

$\boldsymbol{\theta}^{(i)} \sim q$ — left panel: samples drawn from the shifted proposal
$\mathbb{1}[f(\boldsymbol{\theta}^{(i)}) > c]$ — the red tick marks: which samples exceed the threshold
$w_i = \pi / q$ — right panel: the marker sizes that correct for the distributional shift

The estimator is unbiased for any valid $q$ (any distribution positive wherever $\pi \cdot \mathbb{1}[f > c] > 0$). The choice of $q$ affects only the variance — and a good choice can reduce it by orders of magnitude.

Setup: The Beam Model with a Failure Threshold

We now apply importance sampling to a realistic engineering model. We use the analytical cantilever beam with $\mathrm{Beta}(2, 5)$ priors on the two Young’s moduli, the same model as the QMC tutorial.

from pyapprox.util.backends.numpy import NumpyBkd
from pyapprox_benchmarks.pde.cantilever_beam_1d_analytical import (
    build_cantilever_beam_1d_analytical,
)

bkd = NumpyBkd()

problem = build_cantilever_beam_1d_analytical(bkd)
model = problem.function()
variable = problem.prior()

print(f"Number of uncertain parameters: {model.nvars()}")
print(f"Number of QoIs: {model.nqoi()}")

Number of uncertain parameters: 2
Number of QoIs: 3

We define “failure” as tip deflection exceeding a threshold $c$. We choose $c$ so that the true failure probability is small but still computable by quadrature — around $p \approx 0.01$ to $0.001$, depending on the model’s output range.

# Choose a threshold in the upper tail of the deflection distribution
# using quadrature-weighted CDF evaluation.
# Search for a threshold that gives true p ≈ 0.005
# by evaluating P(QoI > c) via quadrature at many candidate thresholds.
qoi_np = bkd.to_numpy(qoi_quad)
wts_np = bkd.to_numpy(quad_wts)
candidates = np.linspace(np.percentile(qoi_np, 95), np.max(qoi_np), 200)
probs = np.array([
    float(np.dot(wts_np, (qoi_np > c).astype(float))) for c in candidates
])

# Pick the threshold closest to the target failure probability
target_p = 0.005
threshold = float(candidates[np.argmin(np.abs(probs - target_p))])

# True failure probability via quadrature
indicator = bkd.asarray((qoi_np > threshold).astype(float))
true_p = bkd.to_float(bkd.sum(quad_wts * indicator))

print(f"Failure threshold c:         {threshold:.6f}")
print(f"True failure probability p:  {true_p:.6f}")
print(f"Samples needed for 10% relative SE (MC): {int(100 / true_p):,}")

Failure threshold c:         4.470276
True failure probability p:  0.004884
Samples needed for 10% relative SE (MC): 20,476

Visualizing the Failure Region

Figure 4 shows the beam model’s response surface in $(E_1, E_2)$ input space. The failure region — where the tip deflection exceeds the threshold — is a thin sliver in the corner of the input domain. Most samples drawn from the prior land far from this region and contribute nothing to the failure probability estimate.

Figure 4: The beam’s input space with tip deflection shown as a contour map. The failure region $\delta_{\text{tip}} > c$ (red shading) occupies a small fraction of the prior’s support. Of $M = 2{,}000$ prior samples, only a handful (red diamonds) land in the failure region — the rest (blue dots) contribute zero to the failure probability estimate.

Choosing a Biasing Distribution

For the beam model with bounded Beta priors on $(E_1, E_2)$, a practical strategy is to shift the sampling distribution toward the failure region. If we know approximately where failures occur — for example, from the response surface in Figure 4 — we can center $q$ near the failure boundary.

A simple approach: use a truncated normal (or Beta with adjusted parameters) centered near the region of input space where $f$ is close to the threshold. For this tutorial, we approximate the failure region’s center from the response surface and define $q$ as a product of independent shifted Beta distributions.

# Identify the approximate center of the failure region
# using the prior samples that exceeded the threshold
fail_samples = samples_vis_np[:, fail_mask]
if fail_samples.shape[1] > 0:
    shift_center = np.mean(fail_samples, axis=1)
else:
    # Fallback: use the point with the highest QoI value
    shift_center = samples_vis_np[:, np.argmax(qoi_vis)]

print(f"Approximate failure region center: {shift_center}")

Approximate failure region center: [18161.47802503  4556.64103994]

# Build the biasing distribution q as a product of independent
# truncated normals centered near the failure region, truncated
# to the prior's support bounds.

# Extract prior support bounds from the marginals
marginal_list = variable.marginals()
prior_bounds = [m.bounds() for m in marginal_list]

# Define biasing distribution: truncated normal per dimension
# Spread chosen to cover the failure region without being too narrow
is_spread_factor = 0.15  # fraction of support width

def log_q_pdf(theta, shift_center, bounds, spread_factor):
    """Log-pdf of the biasing distribution (product of truncated normals)."""
    log_pdf = 0.0
    for d in range(len(shift_center)):
        lo, hi = bounds[d]
        sigma = spread_factor * (hi - lo)
        a, b = (lo - shift_center[d]) / sigma, (hi - shift_center[d]) / sigma
        log_pdf += stats.truncnorm.logpdf(
            theta[d], a, b, loc=shift_center[d], scale=sigma
        )
    return log_pdf

def sample_q(M, shift_center, bounds, spread_factor, rng):
    """Draw M samples from the biasing distribution."""
    nvars = len(shift_center)
    samples = np.empty((nvars, M))
    for d in range(nvars):
        lo, hi = bounds[d]
        sigma = spread_factor * (hi - lo)
        a, b = (lo - shift_center[d]) / sigma, (hi - shift_center[d]) / sigma
        samples[d] = stats.truncnorm.rvs(
            a, b, loc=shift_center[d], scale=sigma, size=M,
            random_state=rng,
        )
    return samples

def log_prior_pdf(theta, variable, bkd):
    """Log-pdf of the prior (IndependentJoint) evaluated at theta."""
    # variable.logpdf expects shape (nvars, nsamples) and returns (1, nsamples)
    log_pdf_arr = variable.logpdf(bkd.asarray(theta))
    return bkd.to_numpy(log_pdf_arr).ravel()

Visualizing the Biasing Distribution

Figure 5 compares prior sampling (left) with importance sampling (right). The IS samples cluster in the failure region; the marker sizes show the importance weights that correct for the distributional shift.

Figure 5: Left: $M = 500$ samples from the prior $\pi$ — few reach the failure region. Right: $M = 500$ samples from the biasing distribution $q$ (truncated normal centered near the failure boundary). Many more samples land in the failure region. Marker size is proportional to the importance weight $w_i = \pi(\boldsymbol{\theta}^{(i)}) / q(\boldsymbol{\theta}^{(i)})$; samples far from the prior center receive smaller weights.

IS in Action: Variance Reduction

We now run the same 500-replicate experiment as before, but using importance sampling with $M = 500$ samples per replicate.

def is_failure_estimates(model, variable, threshold, M, n_reps,
                         shift_center, bounds, spread_factor, bkd):
    """Run n_reps IS experiments, each with M samples."""
    estimates = np.empty(n_reps)
    for i in range(n_reps):
        rng = np.random.RandomState(i + 100000)
        s = sample_q(M, shift_center, bounds, spread_factor, rng)
        # Evaluate model using bkd arrays
        s_bkd = bkd.asarray(s)
        q_vals = bkd.to_numpy(model(s_bkd)[0])
        indicator = (q_vals > threshold).astype(float)

        # Importance weights: pi(theta) / q(theta)
        log_w = log_prior_pdf(s, variable, bkd) - log_q_pdf(
            s, shift_center, bounds, spread_factor
        )
        w = np.exp(log_w)
        estimates[i] = np.mean(indicator * w)
    return estimates

is_estimates_500 = is_failure_estimates(
    model, variable, threshold, 500, n_reps,
    shift_center, prior_bounds, is_spread_factor, bkd,
)

Figure 6 places the MC and IS histograms side by side.

Figure 6: Failure probability estimates from 500 experiments with $M = 500$ samples each. Left: standard MC — wide spread with many zero estimates. Right: importance sampling with a shifted truncated normal — dramatically narrower, no zeros, centered on the true probability.

The IS histogram is dramatically narrower and has no zeros. The variance reduction factor quantifies the improvement: IS with $M = 500$ is as precise as MC with many thousands of samples.

Convergence Comparison

Figure 7 shows the relative RMSE of both estimators as a function of $M$. Both converge at rate $O(M^{-1/2})$ — importance sampling does not change the rate. It reduces the constant, which on a log-log plot appears as a vertical shift between two parallel lines.

Figure 7: Relative RMSE of the failure probability estimate vs. sample size. Both MC (blue) and IS (green) converge at $O(M^{-1/2})$ (parallel lines on the log-log plot), but the IS constant is orders of magnitude smaller. IS with $M = 100$ can achieve the same accuracy as MC with $M \approx 50{,}000$.

The Danger of Bad IS: Weight Degeneracy

Importance sampling is not a free lunch. If $q$ is poorly chosen — shifted too far into the tail, or made too narrow — a small number of samples can acquire enormous importance weights while the rest contribute almost nothing. This is weight degeneracy: the effective sample size collapses even though $M$ is large.

The effective sample size diagnoses this:

\[ \text{ESS} = \frac{\bigl(\sum_{i=1}^{M} w_i\bigr)^2}{\sum_{i=1}^{M} w_i^2} \]

When all weights are equal, $\text{ESS} = M$. When one weight dominates, $\text{ESS} \approx 1$. A ratio $\text{ESS}/M \ll 1$ signals that the IS estimate is unreliable.

Figure 8 compares the distribution of log-importance-weights for three choices of the biasing distribution: a well-chosen shift, an overshifted $q$, and an overly narrow $q$.

Figure 8: Histograms of log importance weights $\log w_i$ for three biasing distributions, each with $M = 500$. Left: well-chosen $q$ near the failure boundary — weights are moderate, ESS is high. Center: $q$ shifted too far past the failure region — a few extreme weights dominate, ESS collapses. Right: $q$ with too-small spread — similar degeneracy. The effective sample size (ESS) is shown in each panel.

The well-chosen $q$ produces log-weights that are concentrated in a narrow range — the effective sample size is a healthy fraction of $M$. The overshifted and too-narrow distributions produce log-weight histograms with heavy tails, indicating that a handful of extreme weights dominate the estimator.

Finding the Failure Region in Practice

The IS estimator above assumed we already knew approximately where failures occur. For the 2D beam model, we could read this from the response surface plot. In realistic problems with expensive models and high-dimensional input spaces, this is not straightforward.

Looking ahead: surrogates and optimization

A common practical workflow is to build a cheap surrogate model (e.g., a polynomial chaos expansion or Gaussian process emulator) from a modest pilot sample, and then use the surrogate to identify the failure region. Specifically:

Optimization on the surrogate. Solve $\min_{\boldsymbol{\theta}} -f_{\text{surr}}(\boldsymbol{\theta})$ subject to $\boldsymbol{\theta}$ being in the prior’s support, or equivalently find the most probable failure point (MPP) by minimizing the distance from the prior mode subject to $f_{\text{surr}}(\boldsymbol{\theta}) \ge c$. This is a standard reliability analysis technique (FORM/SORM) and locates the region of input space where failures are most likely under the prior.
Hessian-based proposal covariance. The Hessian of $\log \pi(\boldsymbol{\theta})$ (or of the limit-state function $g = f - c$) at the MPP provides curvature information about the failure boundary. Its inverse gives a natural covariance matrix for the biasing distribution $q$ — a multivariate normal centered at the MPP with covariance proportional to the inverse Hessian. This adapts the shape and orientation of $q$ to the local geometry of the failure surface, without requiring manual tuning.
Iterative refinement. Run IS with the surrogate-based $q$, use the resulting weighted samples to update the surrogate in the failure region, and re-optimize. This focuses surrogate accuracy where it matters most for the failure probability estimate.

These techniques make IS practical for expensive, high-dimensional models where visual inspection of the failure region is impossible. They are the subject of a later tutorial.

Exercises

Vary the threshold $c$ to produce true failure probabilities ranging from $10^{-2}$ to $10^{-4}$. How does the IS variance reduction factor (ratio of MC variance to IS variance at fixed $M$) change as $p$ decreases? Does IS become more or less valuable for rarer events?
For the beam model, use a pilot MC run of $M = 200$ to estimate the center of the failure region (mean of the failing samples). Use this as shift_center for IS. Compare the resulting variance reduction to using an arbitrary shift. How sensitive is the result to the accuracy of the center estimate?
Implement the effective sample size $\text{ESS} = (\sum w_i)^2 / \sum w_i^2$ and monitor it as you vary the is_spread_factor from 0.05 to 0.5. Plot $\text{ESS}/M$ vs. spread factor. Where does ESS collapse? What is the practical “sweet spot”?
(Challenge) Combine importance sampling with QMC: use a Sobol sequence in $[0,1]^2$ and transform it through $q$’s inverse CDF (instead of the prior’s). Compare the variance of this QMC-IS estimator to both plain IS and plain MC. Does the QMC structure help even though the integrand contains a discontinuous indicator?

Key Takeaways

MC needs $M \approx 100/p$ samples for 10% relative error on a failure probability $p$ — prohibitive for rare events
QMC cannot help much: the indicator integrand is discontinuous, so the Koksma–Hlawka advantage disappears
Importance sampling reweights samples from a biasing distribution $q$ that concentrates mass in the failure region, correcting via the likelihood ratio $\pi/q$
IS does not change the convergence rate ($1/\sqrt{M}$) but can reduce the variance constant by orders of magnitude
Poor choice of $q$ leads to weight degeneracy — a few extreme weights dominate and the estimate becomes unreliable; the effective sample size ESS diagnoses this
In practice, surrogates and optimization (FORM/SORM, Hessian-based proposals) provide systematic ways to construct good biasing distributions for expensive, high-dimensional models

Next Steps

Continue with:

Control Variates — Using correlated auxiliary quantities to reduce estimator variance

--- title: "Importance Sampling for Rare Events" subtitle: "PyApprox Tutorial Library" description: "When Monte Carlo needs millions of samples to see a single failure, importance sampling deliberately targets the tail and corrects with likelihood ratios." tutorial_type: concept topic: uncertainty_quantification difficulty: intermediate estimated_time: 12 render_time: 30 prerequisites: - monte_carlo_sampling - estimator_accuracy_mse tags: - uq - monte-carlo - importance-sampling - rare-events - failure-probability - variance-reduction format: html: code-fold: false code-tools: true toc: true execute: echo: true warning: false jupyter: python3 --- ::: {.callout-tip collapse="true"} ## Download Notebook [Download as Jupyter Notebook](notebooks/importance_sampling_concept.ipynb) ::: ## Learning Objectives After completing this tutorial, you will be able to: - Explain why Monte Carlo is **inefficient for small failure probabilities** and quantify the cost - Formulate a failure probability as an integral and identify the variance bottleneck - Derive the **importance sampling estimator** and explain the likelihood ratio correction - Demonstrate empirically that importance sampling reduces variance by **orders of magnitude** for rare events - Recognize the risk of **weight degeneracy** from a poor biasing distribution ## Prerequisites Complete [Monte Carlo Sampling](monte_carlo_sampling.qmd) and [Estimator Accuracy and MSE](estimator_accuracy_mse.qmd) before this tutorial. ## Why Rare Events Break Monte Carlo Estimating the probability of a rare but critical event --- a bridge deflecting past its safety limit, a circuit failing under thermal stress --- is one of the hardest problems in uncertainty quantification. The difficulty is simple to state: if failures happen one time in ten thousand, then a Monte Carlo experiment with a few hundred samples will almost certainly see **zero failures**. The estimate $\hat{p} = 0$ is technically unbiased (averaged over infinitely many experiments), but any single experiment is useless. The [QMC tutorial](quasi_monte_carlo_sampling.qmd) showed that structured point sets accelerate convergence for smooth integrands. But failure probabilities involve the indicator function $\mathbb{1}[f > c]$, which is discontinuous --- exactly the case where QMC's advantage disappears. We need a fundamentally different idea. Before deriving anything, let's see the problem. ## Seeing the Problem: Where Do the Samples Land? Consider estimating $p = P(x > 2)$ where $x \sim \mathcal{N}(0,1)$. The true probability is about 2.3%. @fig-mc-samples-1d shows 120 samples drawn from the standard normal, with the failure region $x > 2$ shaded. ```{python} #| echo: false import numpy as np import matplotlib.pyplot as plt from scipy import stats from scipy.stats import norm as sp_norm threshold_1d = 2.0 true_p_1d = sp_norm.sf(threshold_1d) np.random.seed(42) n_vis = 120 mc_vis = np.random.normal(0, 1, n_vis) mc_fail_vis = mc_vis > threshold_1d x_plot = np.linspace(-4, 6, 1000) p_pdf = sp_norm.pdf(x_plot, 0, 1) ``` ```{python} #| echo: false #| fig-cap: "120 samples from the standard normal $\\pi$. The failure region $x > 2$ (shaded red) sits in the far tail where $\\pi$ has almost no mass. Only 1 sample (red tick mark) lands past the threshold --- the other 119 contribute nothing to the failure probability estimate." #| label: fig-mc-samples-1d fig, ax = plt.subplots(figsize=(8, 4)) ax.fill_between(x_plot, p_pdf, where=(x_plot >= threshold_1d), alpha=0.18, color="#e74c3c", zorder=1) ax.plot(x_plot, p_pdf, color="#c0392b", lw=2.5, zorder=3, label=r"Target $\pi(x)$ (std. normal)") ax.axvline(threshold_1d, color="#e74c3c", lw=1.8, ls="--", alpha=0.7, zorder=2) y_tick = -0.015 ax.scatter(mc_vis[~mc_fail_vis], np.full((~mc_fail_vis).sum(), y_tick), marker="|", s=50, color="#95a5a6", alpha=0.5, linewidths=1, zorder=4) ax.scatter(mc_vis[mc_fail_vis], np.full(mc_fail_vis.sum(), y_tick), marker="|", s=90, color="#e74c3c", linewidths=2, zorder=5) n_hits = mc_fail_vis.sum() ax.text(0.96, 0.92, f"$p = P(x > {threshold_1d}) = {true_p_1d:.3f}$\n" f"Failures: {n_hits} / {n_vis}", transform=ax.transAxes, ha="right", va="top", fontsize=10, bbox=dict(boxstyle="round,pad=0.4", fc="white", ec="#bdc3c7")) ax.annotate( "Rare event region\n(most samples miss it)", xy=(2.8, 0.008), xytext=(3.8, 0.15), fontsize=9, color="#e74c3c", arrowprops=dict(arrowstyle="-|>", color="#e74c3c", lw=1.5), bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="#e74c3c", alpha=0.9), ) ax.set_xlabel("$x$", fontsize=11) ax.set_ylabel("Density", fontsize=11) ax.set_xlim(-4, 6) ax.set_ylim(-0.035, 0.52) ax.legend(fontsize=9, loc="upper left", framealpha=0.9) ax.grid(True, alpha=0.12) plt.tight_layout() plt.show() ``` The picture tells the whole story: the target density $\pi$ concentrates almost all its mass between $-2$ and $2$. The failure region $x > 2$ sits in the far tail where $\pi$ is negligible --- samples simply don't land there. Of 120 draws, only 1 exceeds the threshold. The other 119 contribute nothing: each returns $\mathbb{1}[x_i > 2] = 0$. ## MC Estimation: The Consequence We just saw that a single MC experiment with 120 samples produced only 1 failure. What happens if we repeat this experiment hundreds of times? @fig-mc-failure-histogram runs 500 independent experiments at two sample sizes. ```{python} #| echo: false n_reps = 500 def mc_failure_estimates_1d(threshold, M, n_reps): """Run n_reps MC experiments for P(x > threshold), x ~ N(0,1).""" estimates = np.empty(n_reps) for i in range(n_reps): np.random.seed(i) samples = np.random.normal(0, 1, M) estimates[i] = np.mean(samples > threshold) return estimates estimates_500_1d = mc_failure_estimates_1d(threshold_1d, 500, n_reps) estimates_5000_1d = mc_failure_estimates_1d(threshold_1d, 5000, n_reps) ``` ```{python} #| echo: false #| fig-cap: "MC estimates of $P(x > 2)$ from 500 independent experiments. Left: $M = 500$ --- many experiments see zero failures, producing $\\hat{p} = 0$ (the tall spike at the origin). Right: $M = 5{,}000$ --- fewer zeros, but estimates still scatter widely around the true value (red dashed line)." #| label: fig-mc-failure-histogram fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4.5), sharey=False) for ax, ests, M_lab in [(ax1, estimates_500_1d, 500), (ax2, estimates_5000_1d, 5000)]: ax.hist(ests, bins=40, density=True, alpha=0.7, color="steelblue", edgecolor="white", linewidth=0.5) ax.axvline(true_p_1d, color="firebrick", linestyle="--", linewidth=2, label=f"True $p = {true_p_1d:.4f}$") n_zeros = np.sum(ests == 0) ax.set_title(f"MC, $M = {M_lab:,}$", fontsize=12) ax.set_xlabel(r"$\hat{p}_M$") ax.legend(fontsize=9) ax.text(0.97, 0.80, f"$\\hat{{p}} = 0$ in {n_zeros}/{n_reps}\n" f"({100 * n_zeros / n_reps:.0f}%) of experiments", transform=ax.transAxes, ha="right", fontsize=9, bbox=dict(boxstyle="round,pad=0.3", fc="white", alpha=0.85)) ax1.set_ylabel("Density") plt.tight_layout() plt.show() ``` The spike at zero in the left panel is exactly what @fig-mc-samples-1d predicted: experiments that drew hundreds of samples without a single one reaching the tail. The estimator is unbiased --- averaged over all 500 experiments the mean is close to the true $p$ --- but any individual estimate is nearly useless. We can now quantify the pain. The MC estimator $\hat{p}_M = \frac{1}{M}\sum \mathbb{1}[x_i > c]$ is a sample mean of Bernoulli($p$) random variables. Its variance is $p(1-p)/M \approx p/M$ when $p$ is small, so the relative standard error is: $$ \frac{\text{SE}[\hat{p}_M]}{p} = \frac{1}{\sqrt{Mp}} $$ To achieve 10% relative error we need $M \approx 100/p$. For $p = 10^{-3}$ that is $100{,}000$ samples; for $p = 10^{-6}$ it is $100$ million. This is the formula behind the spike at zero: with $M = 500$ and $p \approx 0.023$, we need roughly $M \approx 4{,}000$ samples for 10% accuracy, so most 500-sample experiments miss the tail entirely. What if we could sample from a *different* distribution --- one that puts its mass right where failures occur? ## The Key Idea: Sample Where Failures Happen The fix is to **sample from a different distribution $q$ that places mass where failures occur**, then correct for the fact that we sampled from the wrong distribution. @fig-is-fix shows both steps. ```{python} #| echo: false #| fig-cap: "The importance sampling fix, continuing the 1D example from @fig-mc-samples-1d. **(Left)** A shifted proposal $q$ (blue, centered at 2.5) places most of its mass past the threshold. Of 120 samples, the majority now land in the failure region --- compare to the single hit in @fig-mc-samples-1d. **(Right)** But we can't just count them: they were drawn from $q$, not $\\pi$. Each sample is placed on the target density $\\pi$ and sized by its importance weight $w_i = \\pi(x_i) / q(x_i)$. Samples near the threshold carry the largest weights (dark red, large markers) because they are relatively most probable under $\\pi$. Samples deep in the tail, where $\\pi$ is negligible, are downweighted (yellow, small markers)." #| label: fig-is-fix np.random.seed(42) n_vis = 120 q_mean = 2.5 q_std = 0.8 q_pdf = sp_norm.pdf(x_plot, q_mean, q_std) is_vis = np.random.normal(q_mean, q_std, n_vis) is_fail_vis = is_vis > threshold_1d weights_vis = sp_norm.pdf(is_vis, 0, 1) / sp_norm.pdf(is_vis, q_mean, q_std) fig, (ax_b, ax_c) = plt.subplots(1, 2, figsize=(13, 4.5)) # --- Left panel: proposal q with samples --- ax_b.fill_between(x_plot, q_pdf, where=(x_plot >= threshold_1d), alpha=0.12, color="#2471a3", zorder=1) ax_b.plot(x_plot, p_pdf, color="#c0392b", lw=1.5, alpha=0.35, zorder=2, label=r"Target $\pi(x)$") ax_b.plot(x_plot, q_pdf, color="#2471a3", lw=2.5, zorder=3, label=r"Proposal $q(x)$") ax_b.axvline(threshold_1d, color="#e74c3c", lw=1.8, ls="--", alpha=0.7, zorder=2) y_tick = -0.015 ax_b.scatter(is_vis[~is_fail_vis], np.full((~is_fail_vis).sum(), y_tick), marker="|", s=50, color="#2471a3", alpha=0.4, linewidths=1, zorder=4) ax_b.scatter(is_vis[is_fail_vis], np.full(is_fail_vis.sum(), y_tick), marker="|", s=90, color="#e74c3c", linewidths=2, zorder=5) n_is_hits = is_fail_vis.sum() ax_b.text(0.96, 0.92, f"Failures: {n_is_hits} / {n_vis}", transform=ax_b.transAxes, ha="right", va="top", fontsize=10, bbox=dict(boxstyle="round,pad=0.4", fc="white", ec="#bdc3c7")) ax_b.annotate( "$q$ concentrates mass\nwhere failures occur", xy=(2.5, 0.42), xytext=(0.0, 0.42), fontsize=9, color="#2471a3", arrowprops=dict(arrowstyle="-|>", color="#2471a3", lw=1.5), bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="#2471a3", alpha=0.9), ) ax_b.set_title("Shifted proposal $q$ — many more hits", fontsize=12, fontweight="bold") ax_b.set_xlabel("$x$", fontsize=11) ax_b.set_ylabel("Density", fontsize=11) ax_b.set_xlim(-4, 6) ax_b.set_ylim(-0.035, 0.52) ax_b.legend(fontsize=9, loc="upper left", framealpha=0.9) ax_b.grid(True, alpha=0.12) # --- Right panel: weighted samples on pi --- ax_c.fill_between(x_plot, p_pdf, where=(x_plot >= threshold_1d), alpha=0.12, color="#e74c3c", zorder=1) ax_c.plot(x_plot, p_pdf, color="#c0392b", lw=2, alpha=0.5, zorder=2) ax_c.axvline(threshold_1d, color="#e74c3c", lw=1.5, ls="--", alpha=0.5, zorder=2) # Safe IS samples: small gray ax_c.scatter(is_vis[~is_fail_vis], sp_norm.pdf(is_vis[~is_fail_vis], 0, 1), s=12, alpha=0.25, color="#95a5a6", edgecolors="none", zorder=4) # Failure IS samples: sized by weight w_fail = weights_vis[is_fail_vis] w_sizes = 20 + 250 * (w_fail / w_fail.max()) sc = ax_c.scatter(is_vis[is_fail_vis], sp_norm.pdf(is_vis[is_fail_vis], 0, 1), s=w_sizes, alpha=0.7, c=w_fail, cmap="YlOrRd", edgecolors="white", linewidth=0.5, zorder=5) cbar = fig.colorbar(sc, ax=ax_c, shrink=0.65, pad=0.03, aspect=15) cbar.set_label(r"Weight $w_i = \pi / q$", fontsize=10) ax_c.annotate( "Samples near threshold\nhave largest weights\n(most likely under $\\pi$)", xy=(2.1, sp_norm.pdf(2.1, 0, 1)), xytext=(-3.0, 0.18), fontsize=8.5, color="#7d3c1f", arrowprops=dict(arrowstyle="-|>", color="#7d3c1f", lw=1.3, connectionstyle="arc3,rad=0.2"), bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="#7d3c1f", alpha=0.9), ) ax_c.set_title("Weights correct for the shift", fontsize=12, fontweight="bold") ax_c.set_xlabel("$x$", fontsize=11) ax_c.set_ylabel(r"$\pi(x)$", fontsize=11) ax_c.set_xlim(-4, 5.5) ax_c.set_ylim(-0.01, 0.44) ax_c.grid(True, alpha=0.12) plt.tight_layout() plt.show() ``` The left panel is the "aha": by shifting the sampling distribution to $q$, the majority of samples now exceed the threshold. But these samples were drawn from $q$, not $\pi$ --- we can't simply count them. The right panel shows the correction: each sample is placed on the target density $\pi$ and scaled by its importance weight $w_i = \pi(x_i)/q(x_i)$. Samples near the threshold are the most valuable --- they are relatively probable under $\pi$ and carry the largest weights. Samples deep in the tail, where $q \gg \pi$, contribute almost nothing after reweighting. This is the entire importance sampling mechanism: **sample where it matters, then correct for the bias**. ## Formalizing the Idea ### The importance sampling estimator The figure above showed the mechanism; now we write it down. The failure probability is an integral: $$ p = P\bigl(f(\boldsymbol{\theta}) > c\bigr) = \int \mathbb{1}\bigl[f(\boldsymbol{\theta}) > c\bigr]\, \pi(\boldsymbol{\theta})\, d\boldsymbol{\theta} $$ Multiply and divide by the proposal density $q$: $$ p = \int \mathbb{1}[f(\boldsymbol{\theta}) > c]\, \frac{\pi(\boldsymbol{\theta})}{q(\boldsymbol{\theta})}\, q(\boldsymbol{\theta})\, d\boldsymbol{\theta} $$ Now sample from $q$ instead of $\pi$: $$ \hat{p}_M^{\text{IS}} = \frac{1}{M} \sum_{i=1}^{M} \mathbb{1}[f(\boldsymbol{\theta}^{(i)}) > c]\, \frac{\pi(\boldsymbol{\theta}^{(i)})}{q(\boldsymbol{\theta}^{(i)})}, \qquad \boldsymbol{\theta}^{(i)} \sim q $$ The ratio $w(\boldsymbol{\theta}) = \pi(\boldsymbol{\theta}) / q(\boldsymbol{\theta})$ is the **likelihood ratio** or **importance weight**. Each term maps directly to something in @fig-is-fix: - $\boldsymbol{\theta}^{(i)} \sim q$ --- left panel: samples drawn from the shifted proposal - $\mathbb{1}[f(\boldsymbol{\theta}^{(i)}) > c]$ --- the red tick marks: which samples exceed the threshold - $w_i = \pi / q$ --- right panel: the marker sizes that correct for the distributional shift The estimator is **unbiased** for any valid $q$ (any distribution positive wherever $\pi \cdot \mathbb{1}[f > c] > 0$). The choice of $q$ affects only the **variance** --- and a good choice can reduce it by orders of magnitude. ## Setup: The Beam Model with a Failure Threshold We now apply importance sampling to a realistic engineering model. We use the analytical cantilever beam with $\mathrm{Beta}(2, 5)$ priors on the two Young's moduli, the same model as the [QMC tutorial](quasi_monte_carlo_sampling.qmd). ```{python} from pyapprox.util.backends.numpy import NumpyBkd from pyapprox_benchmarks.pde.cantilever_beam_1d_analytical import ( build_cantilever_beam_1d_analytical, ) bkd = NumpyBkd() problem = build_cantilever_beam_1d_analytical(bkd) model = problem.function() variable = problem.prior() print(f"Number of uncertain parameters: {model.nvars()}") print(f"Number of QoIs: {model.nqoi()}") ``` We define "failure" as tip deflection exceeding a threshold $c$. We choose $c$ so that the true failure probability is small but still computable by quadrature --- around $p \approx 0.01$ to $0.001$, depending on the model's output range. ```{python} #| echo: false from pyapprox.surrogates.quadrature import ( TensorProductQuadratureRule, gauss_quadrature_rule, ) nquad = 40 marginals = variable.marginals() quad_rules = [lambda n, m=m: gauss_quadrature_rule(m, n, bkd) for m in marginals] quad_rule = TensorProductQuadratureRule(bkd, quad_rules, [nquad] * len(marginals)) quad_pts, quad_wts = quad_rule() qoi_quad = model(quad_pts)[0] # tip deflection ``` ```{python} # Choose a threshold in the upper tail of the deflection distribution # using quadrature-weighted CDF evaluation. # Search for a threshold that gives true p ≈ 0.005 # by evaluating P(QoI > c) via quadrature at many candidate thresholds. qoi_np = bkd.to_numpy(qoi_quad) wts_np = bkd.to_numpy(quad_wts) candidates = np.linspace(np.percentile(qoi_np, 95), np.max(qoi_np), 200) probs = np.array([ float(np.dot(wts_np, (qoi_np > c).astype(float))) for c in candidates ]) # Pick the threshold closest to the target failure probability target_p = 0.005 threshold = float(candidates[np.argmin(np.abs(probs - target_p))]) # True failure probability via quadrature indicator = bkd.asarray((qoi_np > threshold).astype(float)) true_p = bkd.to_float(bkd.sum(quad_wts * indicator)) print(f"Failure threshold c: {threshold:.6f}") print(f"True failure probability p: {true_p:.6f}") print(f"Samples needed for 10% relative SE (MC): {int(100 / true_p):,}") ``` ## Visualizing the Failure Region @fig-failure-region shows the beam model's response surface in $(E_1, E_2)$ input space. The failure region --- where the tip deflection exceeds the threshold --- is a thin sliver in the corner of the input domain. Most samples drawn from the prior land far from this region and contribute nothing to the failure probability estimate. ```{python} #| echo: false #| fig-cap: "The beam's input space with tip deflection shown as a contour map. The failure region $\\delta_{\\text{tip}} > c$ (red shading) occupies a small fraction of the prior's support. Of $M = 2{,}000$ prior samples, only a handful (red diamonds) land in the failure region --- the rest (blue dots) contribute zero to the failure probability estimate." #| label: fig-failure-region # Evaluate model on a grid for the contour plot n_grid = 100 marginal_list = variable.marginals() # Extract bounds directly from the marginals x1_lo, x1_hi = marginal_list[0].bounds() x2_lo, x2_hi = marginal_list[1].bounds() x1_grid = np.linspace(x1_lo, x1_hi, n_grid) x2_grid = np.linspace(x2_lo, x2_hi, n_grid) X1, X2 = np.meshgrid(x1_grid, x2_grid) grid_pts = bkd.asarray(np.vstack([X1.ravel(), X2.ravel()])) grid_qoi = bkd.to_numpy(model(grid_pts)[0]).reshape(n_grid, n_grid) # Draw prior samples and classify np.random.seed(42) M_vis = 2000 samples_vis = variable.rvs(M_vis) qoi_vis = bkd.to_numpy(model(samples_vis)[0]) fail_mask = qoi_vis > threshold fig, ax = plt.subplots(figsize=(8, 6)) # Contour of tip deflection cf = ax.contourf(X1, X2, grid_qoi, levels=30, cmap="YlOrRd", alpha=0.6) ax.contour(X1, X2, grid_qoi, levels=[threshold], colors="firebrick", linewidths=2.5, linestyles="--") # Shade failure region ax.contourf(X1, X2, grid_qoi, levels=[threshold, grid_qoi.max()], colors=["firebrick"], alpha=0.15) # Prior samples samples_vis_np = bkd.to_numpy(samples_vis) ax.scatter(samples_vis_np[0, ~fail_mask], samples_vis_np[1, ~fail_mask], s=6, alpha=0.3, color="steelblue", edgecolors="none", label=f"Safe ({int((~fail_mask).sum())})") ax.scatter(samples_vis_np[0, fail_mask], samples_vis_np[1, fail_mask], s=30, alpha=0.9, color="firebrick", edgecolors="white", linewidth=0.5, marker="D", label=f"Failure ({int(fail_mask.sum())})") ax.set_xlabel(r"$E_1$", fontsize=12) ax.set_ylabel(r"$E_2$", fontsize=12) ax.set_title(r"Failure region: $\delta_{\mathrm{tip}} > c$", fontsize=13) ax.legend(fontsize=10, loc="upper right") cbar = fig.colorbar(cf, ax=ax, label=r"$\delta_{\mathrm{tip}}$") plt.tight_layout() plt.show() ``` ## Choosing a Biasing Distribution For the beam model with bounded Beta priors on $(E_1, E_2)$, a practical strategy is to **shift** the sampling distribution toward the failure region. If we know approximately where failures occur --- for example, from the response surface in @fig-failure-region --- we can center $q$ near the failure boundary. A simple approach: use a truncated normal (or Beta with adjusted parameters) centered near the region of input space where $f$ is close to the threshold. For this tutorial, we approximate the failure region's center from the response surface and define $q$ as a product of independent shifted Beta distributions. ```{python} # Identify the approximate center of the failure region # using the prior samples that exceeded the threshold fail_samples = samples_vis_np[:, fail_mask] if fail_samples.shape[1] > 0: shift_center = np.mean(fail_samples, axis=1) else: # Fallback: use the point with the highest QoI value shift_center = samples_vis_np[:, np.argmax(qoi_vis)] print(f"Approximate failure region center: {shift_center}") ``` ```{python} # Build the biasing distribution q as a product of independent # truncated normals centered near the failure region, truncated # to the prior's support bounds. # Extract prior support bounds from the marginals marginal_list = variable.marginals() prior_bounds = [m.bounds() for m in marginal_list] # Define biasing distribution: truncated normal per dimension # Spread chosen to cover the failure region without being too narrow is_spread_factor = 0.15 # fraction of support width def log_q_pdf(theta, shift_center, bounds, spread_factor): """Log-pdf of the biasing distribution (product of truncated normals).""" log_pdf = 0.0 for d in range(len(shift_center)): lo, hi = bounds[d] sigma = spread_factor * (hi - lo) a, b = (lo - shift_center[d]) / sigma, (hi - shift_center[d]) / sigma log_pdf += stats.truncnorm.logpdf( theta[d], a, b, loc=shift_center[d], scale=sigma ) return log_pdf def sample_q(M, shift_center, bounds, spread_factor, rng): """Draw M samples from the biasing distribution.""" nvars = len(shift_center) samples = np.empty((nvars, M)) for d in range(nvars): lo, hi = bounds[d] sigma = spread_factor * (hi - lo) a, b = (lo - shift_center[d]) / sigma, (hi - shift_center[d]) / sigma samples[d] = stats.truncnorm.rvs( a, b, loc=shift_center[d], scale=sigma, size=M, random_state=rng, ) return samples def log_prior_pdf(theta, variable, bkd): """Log-pdf of the prior (IndependentJoint) evaluated at theta.""" # variable.logpdf expects shape (nvars, nsamples) and returns (1, nsamples) log_pdf_arr = variable.logpdf(bkd.asarray(theta)) return bkd.to_numpy(log_pdf_arr).ravel() ``` ## Visualizing the Biasing Distribution @fig-is-biasing-visual compares prior sampling (left) with importance sampling (right). The IS samples cluster in the failure region; the marker sizes show the importance weights that correct for the distributional shift. ```{python} #| echo: false #| fig-cap: "Left: $M = 500$ samples from the prior $\\pi$ --- few reach the failure region. Right: $M = 500$ samples from the biasing distribution $q$ (truncated normal centered near the failure boundary). Many more samples land in the failure region. Marker size is proportional to the importance weight $w_i = \\pi(\\boldsymbol{\\theta}^{(i)}) / q(\\boldsymbol{\\theta}^{(i)})$; samples far from the prior center receive smaller weights." #| label: fig-is-biasing-visual M_is_vis = 500 # Prior samples np.random.seed(99) s_prior = bkd.to_numpy(variable.rvs(M_is_vis)) q_prior = bkd.to_numpy(model(bkd.asarray(s_prior))[0]) fail_prior = q_prior > threshold # IS samples rng_is = np.random.RandomState(99) s_is = sample_q(M_is_vis, shift_center, prior_bounds, is_spread_factor, rng_is) q_is_vals = bkd.to_numpy(model(bkd.asarray(s_is))[0]) fail_is = q_is_vals > threshold # Compute importance weights log_w = log_prior_pdf(s_is, variable, bkd) - log_q_pdf( s_is, shift_center, prior_bounds, is_spread_factor ) w = np.exp(log_w - np.max(log_w)) # stabilize for visualization w_sizes = 5 + 80 * (w / w.max()) # map to marker sizes fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5.5)) # Left: prior samples ax1.contourf(X1, X2, grid_qoi, levels=[threshold, grid_qoi.max()], colors=["firebrick"], alpha=0.1) ax1.contour(X1, X2, grid_qoi, levels=[threshold], colors="firebrick", linewidths=2, linestyles="--") ax1.scatter(s_prior[0, ~fail_prior], s_prior[1, ~fail_prior], s=10, alpha=0.4, color="steelblue", edgecolors="none") ax1.scatter(s_prior[0, fail_prior], s_prior[1, fail_prior], s=30, alpha=0.9, color="firebrick", marker="D", edgecolors="white", linewidth=0.5) ax1.set_title(f"Prior $\\pi$ (failures: {int(fail_prior.sum())}/{M_is_vis})", fontsize=12) ax1.set_xlabel(r"$E_1$") ax1.set_ylabel(r"$E_2$") ax1.grid(True, alpha=0.2) # Right: IS samples with weight-scaled markers ax2.contourf(X1, X2, grid_qoi, levels=[threshold, grid_qoi.max()], colors=["firebrick"], alpha=0.1) ax2.contour(X1, X2, grid_qoi, levels=[threshold], colors="firebrick", linewidths=2, linestyles="--") ax2.scatter(s_is[0, ~fail_is], s_is[1, ~fail_is], s=w_sizes[~fail_is], alpha=0.4, color="darkorange", edgecolors="none") ax2.scatter(s_is[0, fail_is], s_is[1, fail_is], s=w_sizes[fail_is], alpha=0.9, color="firebrick", marker="D", edgecolors="white", linewidth=0.5) ax2.set_title(f"Biasing $q$ (failures: {int(fail_is.sum())}/{M_is_vis})", fontsize=12) ax2.set_xlabel(r"$E_1$") ax2.set_ylabel(r"$E_2$") ax2.grid(True, alpha=0.2) plt.tight_layout() plt.show() ``` ## IS in Action: Variance Reduction We now run the same 500-replicate experiment as before, but using importance sampling with $M = 500$ samples per replicate. ```{python} #| echo: false def mc_failure_estimates(variable, model, threshold, M, n_reps, bkd): """Run n_reps MC experiments, each with M samples.""" estimates = np.empty(n_reps) for i in range(n_reps): np.random.seed(i) s = variable.rvs(M) q = bkd.to_numpy(model(s)[0]) estimates[i] = np.mean(q > threshold) return estimates estimates_500 = mc_failure_estimates(variable, model, threshold, 500, n_reps, bkd) estimates_5000 = mc_failure_estimates(variable, model, threshold, 5000, n_reps, bkd) ``` ```{python} def is_failure_estimates(model, variable, threshold, M, n_reps, shift_center, bounds, spread_factor, bkd): """Run n_reps IS experiments, each with M samples.""" estimates = np.empty(n_reps) for i in range(n_reps): rng = np.random.RandomState(i + 100000) s = sample_q(M, shift_center, bounds, spread_factor, rng) # Evaluate model using bkd arrays s_bkd = bkd.asarray(s) q_vals = bkd.to_numpy(model(s_bkd)[0]) indicator = (q_vals > threshold).astype(float) # Importance weights: pi(theta) / q(theta) log_w = log_prior_pdf(s, variable, bkd) - log_q_pdf( s, shift_center, bounds, spread_factor ) w = np.exp(log_w) estimates[i] = np.mean(indicator * w) return estimates is_estimates_500 = is_failure_estimates( model, variable, threshold, 500, n_reps, shift_center, prior_bounds, is_spread_factor, bkd, ) ``` @fig-is-vs-mc-histogram places the MC and IS histograms side by side. ```{python} #| echo: false #| fig-cap: "Failure probability estimates from 500 experiments with $M = 500$ samples each. Left: standard MC --- wide spread with many zero estimates. Right: importance sampling with a shifted truncated normal --- dramatically narrower, no zeros, centered on the true probability." #| label: fig-is-vs-mc-histogram fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4.5)) # Shared x-limits so the width difference is visually obvious all_ests = np.concatenate([estimates_500, is_estimates_500]) xlim = (min(0, all_ests.min()) - 0.001, all_ests.max() + 0.001) # MC panel (reuse estimates_500 from earlier) ax1.hist(estimates_500, bins=40, density=True, alpha=0.7, color="steelblue", edgecolor="white", linewidth=0.5) ax1.axvline(true_p, color="firebrick", linestyle="--", linewidth=2, label=f"True $p$") n_zeros_mc = np.sum(estimates_500 == 0) ax1.set_title(f"MC, $M = 500$", fontsize=12) ax1.set_xlabel(r"$\hat{p}$") ax1.set_ylabel("Density") ax1.set_xlim(xlim) ax1.legend(fontsize=9) ax1.text(0.97, 0.80, f"std = {np.std(estimates_500):.2e}\n" f"zeros: {n_zeros_mc}/{n_reps}", transform=ax1.transAxes, ha="right", fontsize=9, bbox=dict(boxstyle="round,pad=0.3", fc="white", alpha=0.85)) # IS panel ax2.hist(is_estimates_500, bins=40, density=True, alpha=0.7, color="seagreen", edgecolor="white", linewidth=0.5) ax2.axvline(true_p, color="firebrick", linestyle="--", linewidth=2, label=f"True $p$") ax2.set_title(f"Importance Sampling, $M = 500$", fontsize=12) ax2.set_xlabel(r"$\hat{p}^{\mathrm{IS}}$") ax2.set_xlim(xlim) ax2.legend(fontsize=9) ax2.text(0.97, 0.80, f"std = {np.std(is_estimates_500):.2e}\n" f"zeros: {np.sum(is_estimates_500 == 0)}/{n_reps}", transform=ax2.transAxes, ha="right", fontsize=9, bbox=dict(boxstyle="round,pad=0.3", fc="white", alpha=0.85)) plt.tight_layout() plt.show() ``` The IS histogram is dramatically narrower and has no zeros. The variance reduction factor quantifies the improvement: IS with $M = 500$ is as precise as MC with many thousands of samples. ## Convergence Comparison @fig-is-convergence shows the relative RMSE of both estimators as a function of $M$. Both converge at rate $O(M^{-1/2})$ --- importance sampling does not change the rate. It reduces the **constant**, which on a log-log plot appears as a vertical shift between two parallel lines. ```{python} #| echo: false #| fig-cap: "Relative RMSE of the failure probability estimate vs. sample size. Both MC (blue) and IS (green) converge at $O(M^{-1/2})$ (parallel lines on the log-log plot), but the IS constant is orders of magnitude smaller. IS with $M = 100$ can achieve the same accuracy as MC with $M \\approx 50{,}000$." #| label: fig-is-convergence M_conv_values = [50, 100, 200, 500, 1000, 2000, 5000] n_reps_conv = 300 rrmse_mc = [] rrmse_is = [] for M in M_conv_values: # MC mc_ests = mc_failure_estimates(variable, model, threshold, M, n_reps_conv, bkd) rrmse_mc.append(np.sqrt(np.mean((mc_ests - true_p)**2)) / true_p) # IS is_ests = is_failure_estimates( model, variable, threshold, M, n_reps_conv, shift_center, prior_bounds, is_spread_factor, bkd, ) rrmse_is.append(np.sqrt(np.mean((is_ests - true_p)**2)) / true_p) fig, ax = plt.subplots(figsize=(8, 5)) M_arr = np.array(M_conv_values, dtype=float) ax.loglog(M_arr, rrmse_mc, "o-", color="steelblue", label="MC", markersize=6, linewidth=1.5) ax.loglog(M_arr, rrmse_is, "D--", color="seagreen", label="Importance Sampling", markersize=6, linewidth=1.5) # Reference slope c_ref = rrmse_mc[0] * M_arr[0]**0.5 ax.loglog(M_arr, c_ref * M_arr**(-0.5), ":", color="gray", linewidth=1, label=r"$O(M^{-1/2})$") ax.set_xlabel("Number of samples $M$", fontsize=12) ax.set_ylabel("Relative RMSE", fontsize=12) ax.set_title("Convergence: MC vs. Importance Sampling", fontsize=13) ax.legend(fontsize=10) ax.grid(True, which="both", alpha=0.2) plt.tight_layout() plt.show() ``` ## The Danger of Bad IS: Weight Degeneracy Importance sampling is not a free lunch. If $q$ is poorly chosen --- shifted too far into the tail, or made too narrow --- a small number of samples can acquire enormous importance weights while the rest contribute almost nothing. This is **weight degeneracy**: the effective sample size collapses even though $M$ is large. The **effective sample size** diagnoses this: $$ \text{ESS} = \frac{\bigl(\sum_{i=1}^{M} w_i\bigr)^2}{\sum_{i=1}^{M} w_i^2} $$ When all weights are equal, $\text{ESS} = M$. When one weight dominates, $\text{ESS} \approx 1$. A ratio $\text{ESS}/M \ll 1$ signals that the IS estimate is unreliable. @fig-weight-degeneracy compares the distribution of log-importance-weights for three choices of the biasing distribution: a well-chosen shift, an overshifted $q$, and an overly narrow $q$. ```{python} #| echo: false #| fig-cap: "Histograms of log importance weights $\\log w_i$ for three biasing distributions, each with $M = 500$. Left: well-chosen $q$ near the failure boundary --- weights are moderate, ESS is high. Center: $q$ shifted too far past the failure region --- a few extreme weights dominate, ESS collapses. Right: $q$ with too-small spread --- similar degeneracy. The effective sample size (ESS) is shown in each panel." #| label: fig-weight-degeneracy # Three q choices: good, overshifted, too narrow shift_configs = [ ("Well-chosen $q$", shift_center, is_spread_factor), ] # Overshifted: push the center further into the tail prior_mean = np.array([ 0.5 * (lo + hi) for lo, hi in prior_bounds ]) overshifted_center = shift_center + 1.5 * (shift_center - prior_mean) shift_configs.append(("Overshifted $q$", overshifted_center, is_spread_factor)) # Too narrow: same center but very small spread shift_configs.append(("Too-narrow $q$", shift_center, is_spread_factor * 0.3)) M_deg = 500 fig, axes = plt.subplots(1, 3, figsize=(15, 4)) colors = ["seagreen", "darkorange", "firebrick"] for ax, (label, center, spread), color in zip(axes, shift_configs, colors): rng = np.random.RandomState(42) s = sample_q(M_deg, center, prior_bounds, spread, rng) q_vals = bkd.to_numpy(model(bkd.asarray(s))[0]) log_w = log_prior_pdf(s, variable, bkd) - log_q_pdf( s, center, prior_bounds, spread ) w = np.exp(log_w) ess = (np.sum(w))**2 / np.sum(w**2) ax.hist(log_w, bins=40, density=True, alpha=0.7, color=color, edgecolor="white", linewidth=0.5) ax.set_title(label, fontsize=12) ax.set_xlabel(r"$\log\, w_i$") if ax == axes[0]: ax.set_ylabel("Density") ax.text(0.97, 0.85, f"ESS = {ess:.0f} / {M_deg}\n({100 * ess / M_deg:.0f}%)", transform=ax.transAxes, ha="right", fontsize=10, bbox=dict(boxstyle="round,pad=0.3", fc="white", alpha=0.85)) plt.tight_layout() plt.show() ``` The well-chosen $q$ produces log-weights that are concentrated in a narrow range --- the effective sample size is a healthy fraction of $M$. The overshifted and too-narrow distributions produce log-weight histograms with heavy tails, indicating that a handful of extreme weights dominate the estimator. ## Finding the Failure Region in Practice The IS estimator above assumed we already knew approximately where failures occur. For the 2D beam model, we could read this from the response surface plot. In realistic problems with expensive models and high-dimensional input spaces, this is not straightforward. ::: {.callout-note} ## Looking ahead: surrogates and optimization A common practical workflow is to build a cheap **surrogate model** (e.g., a polynomial chaos expansion or Gaussian process emulator) from a modest pilot sample, and then use the surrogate to identify the failure region. Specifically: 1. **Optimization on the surrogate.** Solve $\min_{\boldsymbol{\theta}} -f_{\text{surr}}(\boldsymbol{\theta})$ subject to $\boldsymbol{\theta}$ being in the prior's support, or equivalently find the **most probable failure point** (MPP) by minimizing the distance from the prior mode subject to $f_{\text{surr}}(\boldsymbol{\theta}) \ge c$. This is a standard reliability analysis technique (FORM/SORM) and locates the region of input space where failures are most likely under the prior. 2. **Hessian-based proposal covariance.** The Hessian of $\log \pi(\boldsymbol{\theta})$ (or of the limit-state function $g = f - c$) at the MPP provides curvature information about the failure boundary. Its inverse gives a natural covariance matrix for the biasing distribution $q$ --- a multivariate normal centered at the MPP with covariance proportional to the inverse Hessian. This adapts the shape and orientation of $q$ to the local geometry of the failure surface, without requiring manual tuning. 3. **Iterative refinement.** Run IS with the surrogate-based $q$, use the resulting weighted samples to update the surrogate in the failure region, and re-optimize. This focuses surrogate accuracy where it matters most for the failure probability estimate. These techniques make IS practical for expensive, high-dimensional models where visual inspection of the failure region is impossible. They are the subject of a later tutorial. ::: ## Exercises 1. Vary the threshold $c$ to produce true failure probabilities ranging from $10^{-2}$ to $10^{-4}$. How does the IS variance reduction factor (ratio of MC variance to IS variance at fixed $M$) change as $p$ decreases? Does IS become *more* or *less* valuable for rarer events? 2. For the beam model, use a pilot MC run of $M = 200$ to estimate the center of the failure region (mean of the failing samples). Use this as `shift_center` for IS. Compare the resulting variance reduction to using an arbitrary shift. How sensitive is the result to the accuracy of the center estimate? 3. Implement the effective sample size $\text{ESS} = (\sum w_i)^2 / \sum w_i^2$ and monitor it as you vary the `is_spread_factor` from 0.05 to 0.5. Plot $\text{ESS}/M$ vs. spread factor. Where does ESS collapse? What is the practical "sweet spot"? 4. **(Challenge)** Combine importance sampling with QMC: use a Sobol sequence in $[0,1]^2$ and transform it through $q$'s inverse CDF (instead of the prior's). Compare the variance of this QMC-IS estimator to both plain IS and plain MC. Does the QMC structure help even though the integrand contains a discontinuous indicator? ## Key Takeaways - MC needs $M \approx 100/p$ samples for 10% relative error on a failure probability $p$ --- prohibitive for rare events - QMC cannot help much: the indicator integrand is discontinuous, so the Koksma--Hlawka advantage disappears - Importance sampling reweights samples from a biasing distribution $q$ that concentrates mass in the failure region, correcting via the likelihood ratio $\pi/q$ - IS does not change the convergence **rate** ($1/\sqrt{M}$) but can reduce the variance **constant** by orders of magnitude - Poor choice of $q$ leads to **weight degeneracy** --- a few extreme weights dominate and the estimate becomes unreliable; the effective sample size ESS diagnoses this - In practice, surrogates and optimization (FORM/SORM, Hessian-based proposals) provide systematic ways to construct good biasing distributions for expensive, high-dimensional models ## Next Steps Continue with: - [Control Variates](control_variate_concept.qmd) --- Using correlated auxiliary quantities to reduce estimator variance