What is the difference between discrimination and calibration?

Discrimination is how well a model separates individuals who do and do not have the outcome, while calibration is how closely the model's predicted probabilities match the observed frequencies. A model can discriminate well yet be poorly calibrated, so both should be assessed.

Why is stepwise variable selection discouraged?

Automated stepwise selection capitalises on chance associations, produces unstable predictor sets that vary across samples, and yields optimistically biased coefficients and performance, which is why pre-specified models with proper validation are generally preferred.

Model Selection and Diagnostics

Model selection and diagnostics are the steps that decide which predictors a regression model should contain and whether the fitted model is trustworthy. Selection chooses and structures the predictors; diagnostics examine residuals, influential observations, and assumptions; and validation checks whether the model performs on data it was not built from. Together they guard against overfitting and misleading conclusions.

Definition

Model selection is the process of deciding which predictors and functional forms to include in a regression model; model diagnostics are the procedures — residual analysis, influence measures, goodness-of-fit and calibration assessment, and validation — used to judge whether the fitted model meets its assumptions and performs adequately.

Scope

This entry covers strategies for building a regression model (including stepwise and full-model approaches and the perils of data-driven selection), residual and influence diagnostics for checking assumptions, measures of fit and predictive performance such as discrimination and calibration, and internal and external validation. It applies across linear and logistic models and is a methodological topic, not clinical guidance.

Core questions

How are predictors chosen, and why is automated stepwise selection criticised?
How are residuals and influential observations used to check a model?
What is the difference between discrimination and calibration?
Why must a prediction model be validated rather than judged only on the data that built it?
How do overfitting and optimism distort apparent performance?

Key concepts

Variable (predictor) selection
Stepwise selection and its pitfalls
Residual analysis
Influential observations and leverage
Goodness of fit
Discrimination and calibration
Overfitting and optimism
Internal and external validation

Mechanisms

Building a regression model involves choosing which predictors enter, in what form, and whether interactions are needed; automated stepwise procedures that add or drop predictors by significance are widely criticised because they capitalise on chance, produce unstable models, and yield optimistic estimates. Diagnostics then examine the fitted model: residual plots reveal departures from linearity and non-constant variance, and influence measures identify observations that disproportionately drive the fit. Performance is judged by goodness of fit and, for prediction, by discrimination (how well the model separates outcomes) and calibration (how closely predicted and observed risks agree). Because a model fitted and evaluated on the same data appears better than it truly is (optimism from overfitting), internal validation (for example resampling) and ideally external validation on new data are required to estimate honest performance.

Clinical relevance

Diagnostic and prognostic models inform much of clinical risk communication, and whether such a model has been properly selected, checked, and validated determines how much weight its predictions deserve. Appraising these steps is part of reading prediction-model studies. This entry describes the methods and is not a basis for individual diagnostic or treatment decisions.

Evidence & guidelines

The TRIPOD statement provides reporting standards for studies that develop or validate multivariable prediction models, and the BMJ prognostic-research series sets out recommended practice for building, validating, and reporting such models. Harrell's text details a full model-building and validation strategy that emphasises avoiding data-driven selection and quantifying optimism.

History

As regression became central to medical research, concern grew that data-driven predictor selection and unchecked fitting produced models that looked impressive in development but failed on new patients. From the 1990s onward, methodologists emphasised diagnostics, internal and external validation, and the distinction between discrimination and calibration; this culminated in consensus reporting guidance, notably the TRIPOD statement, for prediction-model studies.

Debates

Should predictors be chosen by automated stepwise selection?: Stepwise selection driven by significance tests is widely discouraged because it overfits, produces unstable predictor sets, and gives optimistically biased estimates; pre-specified models informed by subject knowledge, with shrinkage and proper validation, are generally preferred.
Why is external validation considered essential for prediction models?: A model evaluated only on its development data appears better than it is because of overfitting; performance on independent data is needed to judge whether predictions generalise, which is why reporting standards stress validation.

Key figures

Frank Harrell
Douglas Altman
Karel Moons
Patrick Royston
Gary Collins

Seminal works

harrell-2015
collins-2015-tripod

Frequently asked questions

What is the difference between discrimination and calibration?: Discrimination is how well a model separates individuals who do and do not have the outcome, while calibration is how closely the model's predicted probabilities match the observed frequencies. A model can discriminate well yet be poorly calibrated, so both should be assessed.
Why is stepwise variable selection discouraged?: Automated stepwise selection capitalises on chance associations, produces unstable predictor sets that vary across samples, and yields optimistically biased coefficients and performance, which is why pre-specified models with proper validation are generally preferred.