Model Ensemble Selection

PyApprox Tutorial Library

Why adding a poorly-correlated low-fidelity model can increase estimator variance relative to using fewer models, and how to automatically identify the best model subset from pilot data alone.

Download Notebook

Download as Jupyter Notebook

Learning Objectives

After completing this tutorial, you will be able to:

Explain why including a weakly-correlated low-fidelity model can increase estimator variance compared to using a smaller ensemble
Describe the ensemble selection problem: finding the subset $\mathcal{M}^* \subseteq \{0,\ldots,M\}$ that minimises predicted variance at a given budget
Interpret a model correlation heatmap to visually identify candidate LF models
Sketch the enumeration strategy that PyApprox uses to solve the selection problem

Prerequisites

Complete PACV Concept or MLBLUE Concept before this tutorial. Ensemble selection is the final degree of freedom in the multi-fidelity design problem: once the estimator family and sample allocation strategy are chosen, you still need to decide which models to include.

The Surprising Cost of Useless Models

Adding a cheap LF model seems like it should always help — more data, lower cost, nothing to lose. This intuition is wrong. Consider a three-model ensemble $\{f_0, f_1, f_2\}$ where $f_1$ is only weakly correlated with $f_0$. The ACV estimator must estimate control variate coefficients and allocate budget across all models. When a model contributes little variance reduction it absorbs samples that would have been better spent on the remaining models, pushing the estimator covariance above the two-model $\{f_0, f_2\}$.

Figure 1: Predicted ACVMF variance (relative to single-fidelity MC) as a function of the HF–LF1 correlation $\rho_{01}$, controlled by the parameter $\theta_1$ of the tunable benchmark. When $\rho_{01}$ is small, the three-model estimator (blue) is *worse* than the best two-model subset (orange dashed). The shaded region marks where using all three models hurts.

Figure 1 shows the crossover: when $\rho_{01}$ is small the three-model estimator is worse than the best two-model subset. As $\rho_{01}$ increases the three-model estimator eventually overtakes the two-model one.

This is not a corner case. In practice, large model libraries routinely include models that are only weakly predictive of the HF output, and naive inclusion degrades performance.

The Ensemble Selection Problem

Given a library of $M$ candidate LF models, the ensemble selection problem is:

Find the subset $\mathcal{M}^* \subseteq \{1, \ldots, M\}$ of LF models (always including $f_0$) such that the estimator using $\{f_0\} \cup \mathcal{M}^*$ achieves the smallest predicted variance at budget $P$.

This is a combinatorial problem: there are $2^M$ possible subsets. For typical ensembles ($M \leq 10$) this is tractable — the outer enumeration over subsets is cheap because each subset’s optimal sample allocation is solved independently.

The full procedure is:

Pilot study: evaluate all candidate models at $N_\text{pilot}$ shared samples to estimate the joint covariance matrix.
Enumerate subsets: for each subset of size $\leq k$ (a user-set cap), build the estimator and optimise its sample allocation at budget $P$.
Select: return the subset and estimator with the smallest optimised variance.

Steps 2–3 require no additional model evaluations beyond the pilot.

The Correlation Heatmap

The joint pilot covariance (or its normalised form, the correlation matrix) is the primary diagnostic for ensemble design. Figure 2 shows the correlation matrices for two different configurations of the tunable benchmark, demonstrating how the correlation structure controls which models are worth including.

Figure 2: Correlation matrices for two configurations of the tunable benchmark. Left ($\theta_1 = 1.4$): $f_1$ is strongly correlated with $f_0$ ($\rho_{01} = 0.96$) — both LF models help. Right ($\theta_1 = 0.6$): $f_1$ is weakly correlated ($\rho_{01} = 0.55$) — only $f_2$ contributes meaningful variance reduction.

From Figure 2, the left panel shows a configuration where both LF models are useful ($\rho_{01} = 0.96$, $\rho_{02} = 0.41$), while the right panel shows a case where only $f_2$ provides meaningful correlation. Whether to include one or both LF models cannot be answered from the heatmap alone — the optimal decision also depends on model costs and the available budget.

How Many Models Is Enough?

Figure 3 shows the ACVMF variance ratio (relative to MC) for the best one-LF and two-LF subsets across a range of $\rho_{01}$ values. The winner depends on the correlation structure: at low $\rho_{01}$, using only $f_2$ is best; at high $\rho_{01}$, the full three-model ensemble wins.

Figure 3: ACVMF variance ratio for the best one-LF subset (orange) and the two-LF ensemble (blue) across different HF–LF1 correlation strengths. At low $\rho_{01}$ the single-model subset $\{f_0, f_2\}$ outperforms the full ensemble; at high $\rho_{01}$ the three-model estimator is clearly better. The crossover identifies the correlation threshold below which $f_1$ should be excluded.

In Figure 3, at low $\rho_{01}$ the best strategy is to exclude $f_1$ and use only $\{f_0, f_2\}$. As $\rho_{01}$ increases, $f_1$ becomes valuable and the three-model ensemble dominates. The crossover point identifies the correlation threshold below which $f_1$ should be dropped from the ensemble.

Why This Matters in Practice

Real model libraries are often assembled opportunistically: a HPC simulation suite might have a fine-mesh model, two coarse-mesh variants, a surrogate trained on a different parameter regime, and an empirical correlation. Not all of these will help. The enumeration over subsets itself is cheap (no additional model evaluations beyond the pilot), but the pilot cost grows with each model added to the library. Approaches that balance exploration of pilot statistics with exploitation of those statistics to compute the estimator are needed to determine the right number of pilot samples.

Key Takeaways

Adding a weakly-correlated LF model can increase estimator variance relative to using a smaller ensemble (see Figure 1)
The ensemble selection problem is: find the best subset of $\leq k$ LF models given the pilot covariance and a target budget
The pilot correlation heatmap (see Figure 2) is the primary visual diagnostic but is not sufficient alone — cost ratios and budget also drive the decision
The optimal ensemble size depends on the correlation structure: stronger correlations favour larger ensembles (see Figure 3)
Enumeration over subsets is cheap: no additional model evaluations are required beyond the pilot

Exercises

From Figure 1, at what approximate $\rho_{01}$ does the three-model estimator break even with the best two-model subset? How does this answer depend on the budget $P$?
From Figure 2, in each configuration, which single LF model would you include first, and why? Would your answer change if model costs were equal?
From Figure 3, at $\rho_{01} \approx 0.7$, the best one-LF and two-LF ensembles have similar variance. What practical consideration might break the tie in favour of fewer models?

Tip

Ready to try this? See API Cookbook → Ensemble Selection.

Next Steps

API Cookbook — AllSubsetsStrategy, ACVSearch, and plot_estimator_variance_reductions in PyApprox

--- title: "Model Ensemble Selection" subtitle: "PyApprox Tutorial Library" description: "Why adding a poorly-correlated low-fidelity model can *increase* estimator variance relative to using fewer models, and how to automatically identify the best model subset from pilot data alone." tutorial_type: concept topic: multi_fidelity difficulty: intermediate estimated_time: 7 render_time: 64 prerequisites: - pacv_concept - mlblue_concept tags: - multi-fidelity - ensemble-selection - model-subset - variance-reduction format: html: code-fold: false code-tools: true toc: true execute: echo: true warning: false jupyter: python3 --- ::: {.callout-tip collapse="true"} ## Download Notebook [Download as Jupyter Notebook](notebooks/ensemble_selection_concept.ipynb) ::: ## Learning Objectives After completing this tutorial, you will be able to: - Explain why including a weakly-correlated low-fidelity model can *increase* estimator variance compared to using a smaller ensemble - Describe the ensemble selection problem: finding the subset $\mathcal{M}^* \subseteq \{0,\ldots,M\}$ that minimises predicted variance at a given budget - Interpret a model correlation heatmap to visually identify candidate LF models - Sketch the enumeration strategy that PyApprox uses to solve the selection problem ## Prerequisites Complete [PACV Concept](pacv_concept.qmd) or [MLBLUE Concept](mlblue_concept.qmd) before this tutorial. Ensemble selection is the final degree of freedom in the multi-fidelity design problem: once the estimator family and sample allocation strategy are chosen, you still need to decide which models to include. ## The Surprising Cost of Useless Models Adding a cheap LF model seems like it should always help — more data, lower cost, nothing to lose. This intuition is wrong. Consider a three-model ensemble $\{f_0, f_1, f_2\}$ where $f_1$ is only weakly correlated with $f_0$. The ACV estimator must estimate control variate coefficients and allocate budget across all models. When a model contributes little variance reduction it absorbs samples that would have been better spent on the remaining models, pushing the estimator covariance *above* the two-model $\{f_0, f_2\}$. ```{python} #| echo: false #| fig-cap: "Predicted ACVMF variance (relative to single-fidelity MC) as a function of the HF–LF1 correlation $\\rho_{01}$, controlled by the parameter $\\theta_1$ of the tunable benchmark. When $\\rho_{01}$ is small, the three-model estimator (blue) is *worse* than the best two-model subset (orange dashed). The shaded region marks where using all three models hurts." #| label: fig-bad-model import matplotlib.pyplot as plt from pyapprox_tutorials.figures._multifidelity_advanced import plot_bad_model fig, ax = plt.subplots(figsize=(9, 4.5)) plot_bad_model(ax) plt.tight_layout() plt.show() ``` @fig-bad-model shows the crossover: when $\rho_{01}$ is small the three-model estimator is worse than the best two-model subset. As $\rho_{01}$ increases the three-model estimator eventually overtakes the two-model one. This is not a corner case. In practice, large model libraries routinely include models that are only weakly predictive of the HF output, and naive inclusion degrades performance. ## The Ensemble Selection Problem Given a library of $M$ candidate LF models, the ensemble selection problem is: > Find the subset $\mathcal{M}^* \subseteq \{1, \ldots, M\}$ of LF models (always > including $f_0$) such that the estimator using $\{f_0\} \cup \mathcal{M}^*$ achieves > the smallest predicted variance at budget $P$. This is a combinatorial problem: there are $2^M$ possible subsets. For typical ensembles ($M \leq 10$) this is tractable — the outer enumeration over subsets is cheap because each subset's optimal sample allocation is solved independently. The full procedure is: 1. **Pilot study**: evaluate all candidate models at $N_\text{pilot}$ shared samples to estimate the joint covariance matrix. 2. **Enumerate subsets**: for each subset of size $\leq k$ (a user-set cap), build the estimator and optimise its sample allocation at budget $P$. 3. **Select**: return the subset and estimator with the smallest optimised variance. Steps 2–3 require no additional model evaluations beyond the pilot. ## The Correlation Heatmap The joint pilot covariance (or its normalised form, the correlation matrix) is the primary diagnostic for ensemble design. @fig-heatmap shows the correlation matrices for two different configurations of the tunable benchmark, demonstrating how the correlation structure controls which models are worth including. ```{python} #| echo: false #| fig-cap: "Correlation matrices for two configurations of the tunable benchmark. Left ($\\theta_1 = 1.4$): $f_1$ is strongly correlated with $f_0$ ($\\rho_{01} = 0.96$) — both LF models help. Right ($\\theta_1 = 0.6$): $f_1$ is weakly correlated ($\\rho_{01} = 0.55$) — only $f_2$ contributes meaningful variance reduction." #| label: fig-heatmap import matplotlib.pyplot as plt from pyapprox_tutorials.figures._multifidelity_advanced import plot_correlation_heatmaps fig, axes = plt.subplots(1, 2, figsize=(9, 3.8)) plot_correlation_heatmaps(axes, fig) plt.tight_layout() plt.show() ``` From @fig-heatmap, the left panel shows a configuration where both LF models are useful ($\rho_{01} = 0.96$, $\rho_{02} = 0.41$), while the right panel shows a case where only $f_2$ provides meaningful correlation. Whether to include one or both LF models cannot be answered from the heatmap alone — the optimal decision also depends on model costs and the available budget. ## How Many Models Is Enough? @fig-nmodels shows the ACVMF variance ratio (relative to MC) for the best one-LF and two-LF subsets across a range of $\rho_{01}$ values. The winner depends on the correlation structure: at low $\rho_{01}$, using only $f_2$ is best; at high $\rho_{01}$, the full three-model ensemble wins. ```{python} #| echo: false #| fig-cap: "ACVMF variance ratio for the best one-LF subset (orange) and the two-LF ensemble (blue) across different HF–LF1 correlation strengths. At low $\\rho_{01}$ the single-model subset $\\{f_0, f_2\\}$ outperforms the full ensemble; at high $\\rho_{01}$ the three-model estimator is clearly better. The crossover identifies the correlation threshold below which $f_1$ should be excluded." #| label: fig-nmodels import matplotlib.pyplot as plt from pyapprox_tutorials.figures._multifidelity_advanced import plot_ensemble_nmodels fig, ax = plt.subplots(figsize=(9, 4.5)) plot_ensemble_nmodels(ax) plt.tight_layout() plt.show() ``` In @fig-nmodels, at low $\rho_{01}$ the best strategy is to exclude $f_1$ and use only $\{f_0, f_2\}$. As $\rho_{01}$ increases, $f_1$ becomes valuable and the three-model ensemble dominates. The crossover point identifies the correlation threshold below which $f_1$ should be dropped from the ensemble. ## Why This Matters in Practice Real model libraries are often assembled opportunistically: a HPC simulation suite might have a fine-mesh model, two coarse-mesh variants, a surrogate trained on a different parameter regime, and an empirical correlation. Not all of these will help. The enumeration over subsets itself is cheap (no additional model evaluations beyond the pilot), but the pilot cost grows with each model added to the library. Approaches that balance exploration of pilot statistics with exploitation of those statistics to compute the estimator are needed to determine the right number of pilot samples. ## Key Takeaways - Adding a weakly-correlated LF model can *increase* estimator variance relative to using a smaller ensemble (see @fig-bad-model) - The ensemble selection problem is: find the best subset of $\leq k$ LF models given the pilot covariance and a target budget - The pilot correlation heatmap (see @fig-heatmap) is the primary visual diagnostic but is not sufficient alone — cost ratios and budget also drive the decision - The optimal ensemble size depends on the correlation structure: stronger correlations favour larger ensembles (see @fig-nmodels) - Enumeration over subsets is cheap: no additional model evaluations are required beyond the pilot ## Exercises 1. From @fig-bad-model, at what approximate $\rho_{01}$ does the three-model estimator break even with the best two-model subset? How does this answer depend on the budget $P$? 2. From @fig-heatmap, in each configuration, which single LF model would you include first, and why? Would your answer change if model costs were equal? 3. From @fig-nmodels, at $\rho_{01} \approx 0.7$, the best one-LF and two-LF ensembles have similar variance. What practical consideration might break the tie in favour of fewer models? ::: {.callout-tip} Ready to try this? See [API Cookbook → Ensemble Selection](multifidelity_estimation_cookbook.qmd#ensemble-selection). ::: ## Next Steps - [API Cookbook](multifidelity_estimation_cookbook.qmd#ensemble-selection) — `AllSubsetsStrategy`, `ACVSearch`, and `plot_estimator_variance_reductions` in PyApprox