What is the difference between discrimination and calibration?

Discrimination is a model's ability to rank patients so that those who experience the outcome get higher predicted risks than those who do not, while calibration is the agreement between predicted probabilities and observed frequencies; a model can discriminate well yet be poorly calibrated, so both matter.

Why is external validation important for clinical prediction models?

Models often perform optimistically on the data used to build them; testing on independent populations and settings reveals how well a model generalises and guards against deploying tools that fail when the case mix or documentation differs from the development data.

Machine Learning and Predictive Analytics in Clinical Care

Machine learning and predictive analytics use patterns in clinical and health data to estimate the probability of outcomes, such as diagnoses, deterioration, readmission, or response to treatment, for individual patients. This topic covers how clinical prediction models are developed, validated, and reported, and the methodological standards that distinguish trustworthy models from misleading ones.

Definition

Clinical machine learning is the use of algorithms that learn statistical relationships from patient data to predict clinically relevant outcomes; a clinical prediction model combines multiple predictors to estimate the probability of a diagnosis (diagnostic) or a future event (prognostic) for an individual.

Scope

The entry covers supervised learning for diagnosis and prognosis, the data sources and features used in clinical settings, the central validation concepts of discrimination, calibration, and external validation, the risks of bias and overfitting, and reporting and appraisal standards such as TRIPOD and PROBAST. It frames clinical machine learning as a methodological topic, describing how predictive tools are built and judged rather than offering clinical recommendations.

Key concepts

Supervised learning (diagnosis and prognosis)
Discrimination, calibration, and clinical usefulness
Internal and external validation
Overfitting and optimism
Dataset shift and generalisability
Algorithmic bias and fairness
Reporting standards (TRIPOD) and risk-of-bias appraisal (PROBAST)
Deep learning and feature learning

Mechanisms

A clinical prediction model is fitted on labelled data, learning how predictors relate to an outcome, and is then assessed for discrimination (how well it separates those who do and do not experience the outcome) and calibration (how well predicted probabilities match observed frequencies). Because models tend to perform optimistically on the data that trained them, internal and especially external validation on new populations are essential, and deployment can be undermined by dataset shift when the target setting differs from the development setting (Rajkomar, 2019). Deep learning extends these ideas by learning features directly from raw inputs such as images, signals, or text, which can improve performance on perceptual tasks while complicating interpretability (Esteva, 2019).

Clinical relevance

Predictive models increasingly feed risk scores, early-warning alerts, and triage tools embedded in clinical systems, so their accuracy, calibration, and fairness directly affect the quality of the guidance clinicians receive. This entry describes how such models are developed and evaluated; model outputs are probabilistic estimates requiring clinical interpretation and oversight, and the text is not a basis for any individual diagnostic or treatment decision.

Evidence & guidelines

Methodological consensus emphasises transparent development and rigorous validation. The TRIPOD statement sets reporting standards for prediction-model studies so that methods and performance can be appraised (Collins, 2015), and PROBAST provides a structured tool for judging risk of bias and applicability in such studies (Wolff, 2019). Reviews of machine learning in medicine stress external validation, calibration, attention to bias, and the gap between retrospective performance and prospective clinical benefit (Rajkomar, 2019; Esteva, 2019).

History

Clinical prediction has long roots in regression-based risk scores, but the 2010s saw rapid growth of machine learning and deep learning fed by electronic health records, imaging, and larger datasets. Alongside this came heightened concern about reproducibility, overstated performance, and bias, prompting reporting and appraisal frameworks (TRIPOD, PROBAST) intended to hold model studies to consistent methodological standards.

Debates

Why do many models perform worse in practice than in development studies?: Inadequate external validation, dataset shift between development and deployment settings, and optimistic reporting mean that strong retrospective performance often fails to translate into prospective clinical benefit, motivating stricter validation and reporting standards.
How should algorithmic bias and fairness be handled?: Models trained on historical data can encode and amplify disparities, raising debate over how to measure fairness, when performance differences across groups are acceptable, and how to monitor deployed models for bias over time.

Key figures

Alvin Rajkomar
Gary S. Collins
Karel G. M. Moons
Isaac Kohane

Seminal works

rajkomar-2019
collins-2015
wolff-2019

Frequently asked questions

What is the difference between discrimination and calibration?: Discrimination is a model's ability to rank patients so that those who experience the outcome get higher predicted risks than those who do not, while calibration is the agreement between predicted probabilities and observed frequencies; a model can discriminate well yet be poorly calibrated, so both matter.
Why is external validation important for clinical prediction models?: Models often perform optimistically on the data used to build them; testing on independent populations and settings reveals how well a model generalises and guards against deploying tools that fail when the case mix or documentation differs from the development data.