What Makes a Good Surrogate?

PyApprox Tutorial Library

How the choice of ansatz, training data, and fitting strategy affect surrogate quality, and how different surrogates exploit different kinds of function structure.

Learning Objectives

After completing this tutorial, you will be able to:

  • Explain why the surrogate’s basis must match the function’s structure
  • Describe what kinds of structure different surrogate methods exploit: polynomial smoothness, anisotropy, low rank, and kernel regularity
  • Explain why sample placement affects surrogate quality and how quasi-Monte Carlo improves on random sampling
  • Describe the curse of dimensionality and why structure exploitation is essential in high dimensions

The Ansatz Must Match the Function

In The Surrogate Modeling Workflow we saw that the ansatz — the functional form of the surrogate — constrains what it can represent. This is the single most consequential decision in the workflow. A poor ansatz cannot be rescued by more data or a better fitting algorithm.

The essential question is: does the basis contain functions that look like the target?

Smooth Functions: Global Polynomials Work Well

Consider a smooth function like \(f(x) = \sin(2\pi x)\,e^{-x^2/2}\). A global polynomial basis can approximate it to high accuracy because smooth functions have rapidly decaying polynomial expansion coefficients. As the degree increases, the surrogate converges quickly.

Discontinuous Functions: Global Polynomials Fail

Now consider a function with a jump discontinuity. A global polynomial is infinitely differentiable everywhere — it is structurally incapable of representing a sharp jump. No matter how many terms we add, the polynomial will overshoot and oscillate near the discontinuity. This is the Gibbs phenomenon, and it is a fundamental limitation of smooth bases applied to non-smooth functions.

A piecewise polynomial basis, by contrast, can place a breakpoint at (or near) the discontinuity and represent each smooth piece independently.

Figure 1: Top row: a smooth function approximated by a global polynomial (left) and a piecewise polynomial (right). Both work well, but the global polynomial is more efficient — it needs fewer terms. Bottom row: a discontinuous function. The global polynomial oscillates wildly near the jump (Gibbs phenomenon), while the piecewise polynomial handles it cleanly.

The lesson is general: every basis encodes assumptions about the target function. When those assumptions are met, convergence is fast. When they are violated, no amount of data or optimisation will compensate. The rest of this tutorial surveys how different surrogate methods encode different structural assumptions.

Exploiting Structure in Multivariate Functions

In one dimension, choosing between a global and a piecewise polynomial already matters. In higher dimensions the stakes are much greater, because the curse of dimensionality (discussed below) makes it impossible to approximate a generic \(d\)-dimensional function without exploiting some kind of structure.

Different surrogate methods target different structural properties of \(f\). Understanding what structure each method exploits is the key to choosing the right tool.

Polynomial Chaos Expansions: Smoothness and the Input Measure

A polynomial chaos expansion (PCE) writes the model output as a sum of multivariate polynomials:

\[ \hat{f}(\mathbf{x}) = \sum_{j=0}^{P-1} c_j \, \Psi_j(\mathbf{x}) \]

where each \(\Psi_j\) is a product of univariate polynomials chosen to be orthogonal with respect to the input probability measure. Gaussian inputs get Hermite polynomials, uniform inputs get Legendre polynomials, and so on.

The structure that PCE exploits is smoothness: if \(f\) is smooth, its polynomial expansion coefficients decay rapidly, so a moderate number of terms gives a good approximation. The connection to the input measure is a bonus — orthogonality means the expansion coefficients directly encode statistical moments, making downstream UQ essentially free once the surrogate is fitted.

Figure 2: PCE coefficients for a smooth function (left) decay rapidly — a few terms capture most of the variance. For a rough function (right) the coefficients decay slowly, requiring many more terms for the same accuracy.

PCEs work best when the function is smooth and the number of important input variables is moderate. For details on building and fitting a PCE, see Building a Polynomial Chaos Surrogate.

Sparse Grids: Exploiting Anisotropy

Many multivariate functions are anisotropic: some input variables have a much stronger effect on the output than others. A full tensor-product polynomial basis allocates the same resolution to every variable, wasting effort on directions that contribute little.

Sparse grids address this by building an approximation from a carefully selected subset of tensor-product terms. The idea is rooted in the Smolyak construction: instead of requiring high resolution in all \(d\) dimensions simultaneously, we combine many lower-dimensional approximations that each resolve one or two directions well.

Figure 3: Left: a full tensor-product grid in 2D with 9 points per dimension (81 total). Centre: a Smolyak sparse grid at comparable accuracy (32 points). Right: an anisotropic sparse grid that allocates more resolution to the important variable x₁ (26 points). Sparse grids dramatically reduce the number of required evaluations by avoiding the corners of the full grid.

The key insight is that sparse grids allocate resolution where it matters. If \(x_1\) drives most of the output variability and \(x_2\) has only a mild effect, an anisotropic sparse grid puts many more points along \(x_1\) and only a coarse resolution along \(x_2\). This can reduce the required number of evaluations from \(n^d\) (full grid) to something closer to \(n \log(n)^{d-1}\) for isotropic sparse grids, with further savings when anisotropy is exploited.

Function Trains: Exploiting Low-Rank Structure

Some multivariate functions have limited interaction between their input variables. A function is rank-1 (separable) if it can be written as a product of univariate functions:

\[ f(x_1, x_2) = g_1(x_1) \cdot g_2(x_2) \]

A rank-\(r\) function is a sum of \(r\) such products. Many functions arising in engineering have low effective rank — the variables interact, but in a limited way that can be captured by a modest number of product terms.

The function train (FT) decomposition represents a \(d\)-dimensional function as a chain of matrix multiplications:

\[ f(x_1, \ldots, x_d) = \mathbf{G}_1(x_1) \cdot \mathbf{G}_2(x_2) \cdots \mathbf{G}_d(x_d) \]

where each core \(\mathbf{G}_k(x_k)\) is a matrix whose entries are univariate functions. The shared matrix dimensions — the ranks — control how much interaction between variables the FT can represent.

Figure 4: A rank-1 function (left) is a product of two 1D functions — its contours are perfectly aligned with the coordinate axes. A rank-2 function (centre) is a sum of two such products, enabling richer structure like multiple peaks. A generic function (right) may require many terms, but low-rank approximations often capture most of the variability with surprisingly small rank.

The computational advantage is dramatic: a full polynomial expansion in \(d\) variables at degree \(p\) has \(\binom{d+p}{d}\) terms, growing exponentially in \(d\). A rank-\(r\) function train has \(\mathcal{O}(d \cdot r^2 \cdot p)\) parameters — linear in \(d\) for fixed rank. This makes FTs one of the few surrogate methods that scale to truly high-dimensional problems, provided the function has low effective rank.

For a detailed construction of function trains, see An Introduction to Function Trains.

Gaussian Processes: Kernel-Encoded Smoothness and Uncertainty

A Gaussian process (GP) is a probability distribution over functions, defined by a mean function and a kernel \(k(\mathbf{x}, \mathbf{x}')\) that encodes assumptions about the function’s smoothness and correlation structure.

The structure a GP exploits is entirely determined by its kernel:

  • A Squared Exponential kernel assumes the function is infinitely smooth
  • A Matérn 5/2 kernel assumes twice-differentiable functions
  • A Matérn 3/2 kernel allows once-differentiable functions with small-scale roughness
  • Anisotropic length scales (one per input dimension) let the GP learn which variables the function varies along most rapidly

The defining advantage of a GP over other surrogates is that it provides built-in uncertainty quantification. At every prediction point, the GP returns not just a best estimate \(\mu^*(\mathbf{x})\) but also a posterior standard deviation \(\sigma^*(\mathbf{x})\) that reflects how far the point is from the training data.

Figure 5: A GP surrogate (blue line) with posterior uncertainty (shaded ±2σ band). Near training points (black dots) the band narrows to near zero — the GP is confident. Between observations the band widens, honestly reporting where the surrogate is uncertain. This built-in uncertainty is the GP’s defining feature.

This uncertainty information is valuable beyond accuracy assessment. It enables adaptive sampling strategies that place new training points where the GP is most uncertain, systematically filling gaps in data coverage. For details, see Building a Gaussian Process Surrogate.

Neural Networks: Compositional Structure

Neural network surrogates learn a hierarchy of nonlinear transformations:

\[ \hat{f}(\mathbf{x}) = W_L \circ \sigma \circ W_{L-1} \circ \cdots \circ \sigma \circ W_1(\mathbf{x}) \]

where \(W_\ell\) are affine maps and \(\sigma\) is a nonlinear activation. The structure they exploit is compositional: many physical phenomena can be described as a sequence of simpler transformations applied one after another. Neural networks learn these intermediate representations from data.

Their flexibility is both a strength and a weakness. Neural networks impose fewer structural assumptions than polynomials or GPs, making them applicable to a broad class of functions. But this flexibility means they typically require more training data and provide no built-in uncertainty quantification. They also lack the direct connection to the input probability measure that makes PCE-based UQ analytically tractable.

Neural network surrogates are most compelling when the input dimension is very large, the data is abundant, or the function has compositional structure that other methods cannot exploit.

Summary: Which Structure Does Each Method Exploit?

Method Structure exploited Strengths Limitations
PCE Smoothness, input measure Analytical UQ, spectral convergence Curse of dimensionality, assumes smoothness
Sparse grids Anisotropy, variable importance Efficient for anisotropic functions Still polynomial-based, moderate \(d\)
Function trains Low-rank variable interactions Linear scaling in \(d\) Requires low effective rank
GPs Kernel smoothness Built-in uncertainty, adaptive sampling \(\mathcal{O}(N^3)\) fitting cost
Neural networks Compositional hierarchy Flexible, handles large \(d\) Data-hungry, no analytical UQ

No single method dominates. The right choice depends on the function’s structure, the input dimension, the computational budget, and the downstream task.

Training Data Design Matters

The second decision in the surrogate workflow is where to sample. Even with the right ansatz, poor sample placement can produce a poor surrogate.

Random Sampling Leaves Gaps

Random Monte Carlo sampling is simple: draw \(N\) points independently from the input distribution. But random points tend to cluster in some regions and leave gaps in others. In 2D the effect is visible; in higher dimensions it becomes severe.

Quasi-Monte Carlo: Better Coverage

Quasi-Monte Carlo (QMC) sequences — such as Sobol, Halton, or lattice rules — are deterministic point sets designed to fill the domain more evenly than random samples. They minimise the discrepancy (a measure of how unevenly the points cover the space).

Figure 6: 256 sample points in 2D drawn by Monte Carlo (left) and a Sobol quasi-Monte Carlo sequence (right). The MC samples cluster and leave visible gaps. The Sobol sequence covers the domain much more uniformly, which translates directly to better surrogate accuracy for the same number of evaluations.

Better Coverage → Better Surrogates

The improved space-filling of QMC translates directly to better surrogate accuracy for the same \(N\). The following experiment fits the same polynomial surrogate to 1D training data drawn by MC and QMC, repeated over many random seeds to show the variability.

Figure 7: Surrogate error vs. number of training points for Monte Carlo (blue) and Sobol QMC (orange) sampling. Bands show the 10th–90th percentile range over 50 repetitions (QMC uses different scrambles). QMC consistently achieves lower error with less variability, especially at small N where every sample point matters most.

The practical message: before spending the budget on more evaluations, consider whether the same budget spent on better-placed evaluations would help more. This is especially important when each evaluation is expensive and the budget is small.

The Curse of Dimensionality

Everything above becomes dramatically harder in high dimensions. To understand why, consider the simplest approach to approximating a \(d\)-dimensional function: evaluate it on a grid.

With \(n\) points per dimension, the total number of grid points is \(n^d\). Even a modest \(n = 10\) becomes completely infeasible in moderate dimensions:

Figure 8: The curse of dimensionality: grid points required for n = 10 points per dimension. At d = 5 we already need 100,000 evaluations. By d = 10, the count exceeds any computational budget. This exponential scaling is why surrogate methods must exploit function structure — smoothness, anisotropy, low rank — to break the curse.

This is why exploiting structure is not optional in high dimensions — it is the only way forward:

  • PCEs with total-degree index sets grow as \(\binom{d+p}{d}\) instead of \(p^d\), but this is still exponential for fixed \(p\). Sparse and low-rank PCE methods push this further.
  • Sparse grids reduce the grid count to \(\mathcal{O}(n \log(n)^{d-1})\), exploiting the fact that most grid points in a full tensor product contribute negligibly.
  • Function trains scale as \(\mathcal{O}(d \cdot r^2 \cdot p)\) — linear in \(d\) — by exploiting low-rank interaction structure.
  • GPs bypass the grid entirely by placing a continuous prior over functions, but their fitting cost \(\mathcal{O}(N^3)\) limits them to moderate training set sizes.

The right surrogate method is the one that matches the structure your function actually has, allowing it to break the exponential scaling barrier.

Key Takeaways

  • The ansatz must match the function’s structure. Global polynomials excel for smooth functions but fail for discontinuities; piecewise bases handle jumps but are less efficient for smooth targets. This mismatch is the most common source of poor surrogate performance.
  • Different surrogate methods exploit different structures: PCE exploits smoothness and the input measure, sparse grids exploit anisotropy, function trains exploit low-rank interactions, GPs encode smoothness via kernels and provide uncertainty, and neural networks exploit compositional hierarchy.
  • Sample placement affects surrogate quality as much as sample count. Quasi-Monte Carlo sequences provide better coverage than random sampling for the same budget.
  • The curse of dimensionality makes structure exploitation essential, not optional, in high dimensions. The surrogate method should be chosen based on what structural properties the target function is expected to have.

Next Steps