What does k-fold cross-validation do?

It divides the data into k equal parts, then trains the model k times, each time holding out a different part for testing and using the rest for training. Averaging the k test results gives an estimate of how the model will perform on unseen data.

Why is nested cross-validation sometimes needed?

If you tune hyperparameters and measure performance with the same cross-validation, the estimate is optimistic because the choices were fit to that data. Nested cross-validation uses an inner loop for tuning and an outer loop for assessment, keeping the two separate.

Cross-Validation and Resampling

Cross-validation and resampling estimate a model's generalization error by repeatedly partitioning or resampling the available data, making efficient use of limited datasets.

Definition

Cross-validation estimates generalization error by partitioning data into folds, training on some folds and testing on the held-out fold, and averaging over rotations; resampling more broadly, including the bootstrap, repeatedly draws subsets of the data to estimate the performance and variability of a learning procedure.

Scope

This topic covers data-reuse methods for assessing models: the train-test split, k-fold and leave-one-out cross-validation, stratified and nested cross-validation for tuning, and the bootstrap for estimating uncertainty. It addresses the bias and variance of these estimators and pitfalls such as data leakage that can invalidate them.

Core questions

How does k-fold cross-validation estimate generalization error?
What are the bias-variance trade-offs of different fold counts?
How does nested cross-validation keep tuning and assessment separate?
How does the bootstrap estimate the variability of an estimate?

Key theories

k-fold cross-validation: Splitting the data into k folds and rotating which fold is held out gives an estimate of generalization error that uses all data for both training and testing, trading computation for a more reliable estimate.
Nested cross-validation: When hyperparameters are tuned, an inner cross-validation loop selects them and an outer loop assesses performance, preventing the optimistic bias that arises from tuning and evaluating on the same data.
The bootstrap: Resampling the data with replacement many times estimates the sampling distribution of a statistic or model performance, providing confidence intervals and error estimates without distributional assumptions.

Clinical relevance

Cross-validation is the standard tool for estimating model performance and selecting models when data are limited, and the bootstrap is widely used to quantify uncertainty; misapplying them, for example by leaking test information into training or tuning on the evaluation data, is a frequent and serious cause of overstated results.

History

Cross-validation was developed as a formal method for estimating prediction error by Stone and Geisser in the 1970s. Efron introduced the bootstrap in 1979, and together these resampling methods became indispensable for evaluation and uncertainty estimation across statistics and machine learning.

Key figures

Mervyn Stone
Bradley Efron
Robert Tibshirani

Seminal works

hastie2009
efron1993
murphy2012

Frequently asked questions

What does k-fold cross-validation do?: It divides the data into k equal parts, then trains the model k times, each time holding out a different part for testing and using the rest for training. Averaging the k test results gives an estimate of how the model will perform on unseen data.
Why is nested cross-validation sometimes needed?: If you tune hyperparameters and measure performance with the same cross-validation, the estimate is optimistic because the choices were fit to that data. Nested cross-validation uses an inner loop for tuning and an outer loop for assessment, keeping the two separate.