Bayesian Model Averaging
Bayesian model averaging accounts for uncertainty about which model is correct by combining the predictions of all candidate models, weighted by their posterior probabilities.
Definition
Bayesian model averaging forms predictions and inferences by taking a weighted average over a set of candidate models, with weights equal to the posterior probability of each model given the data, thereby incorporating model uncertainty into the final answer.
Scope
This topic covers the formulation of model averaging over a model space, posterior model probabilities as weights, its benefit for calibrated prediction under model uncertainty, the practical challenges of large model spaces, and predictive alternatives such as stacking.
Core questions
- How are predictions averaged across models using posterior model probabilities?
- Why does model averaging improve predictive calibration under model uncertainty?
- How are large or infinite model spaces handled in practice?
- How does stacking differ from posterior-probability weighting?
Key concepts
- posterior model probability
- model space
- model uncertainty
- predictive averaging
- stacking
- Occam's window
Key theories
- Averaging over the model space
- Treating the model index as an unknown with its own posterior yields predictions that integrate over models, which under the assumption that the true model is in the set is optimal for prediction.
- Predictive stacking
- When no candidate is exactly correct, stacking chooses combination weights to maximize cross-validated predictive performance, often outperforming posterior-probability weighting in practice.
Clinical relevance
Model averaging produces more honest predictive uncertainty in fields such as climate projection, epidemiological forecasting, and economics, where committing to a single model would understate the true uncertainty.
History
Bayesian model averaging was developed through the 1990s and synthesized in the 1999 tutorial by Hoeting and colleagues. Recognition that the true model is rarely in the candidate set later motivated predictive stacking as a more robust combination method.
Debates
- Model-probability weighting versus stacking
- When all candidate models are wrong, posterior-probability weights can concentrate on a single poor model, so predictive stacking is increasingly preferred for combining models for prediction.
Key figures
- Adrian Raftery
- David Madigan
- Jennifer Hoeting
- Andrew Gelman
Related topics
Seminal works
- hoeting1999
- yao2018
Frequently asked questions
- Why not just pick the single best model?
- Selecting one model ignores the uncertainty about which model is correct and can produce overconfident predictions; averaging over models, or stacking them, propagates that uncertainty and usually improves predictive calibration.