Cross-Validation
Cross-validation estimates how well a model will predict new data by repeatedly fitting it on part of the sample and measuring its error on the held-out remainder.
Definition
Cross-validation is a resampling procedure that estimates the out-of-sample predictive error of a model by partitioning the data into complementary subsets, fitting on some subsets and evaluating prediction error on the others, and averaging over the partitions.
Scope
This topic covers leave-one-out and k-fold cross-validation, the validation-set and repeated cross-validation schemes, their use for model selection and tuning-parameter choice, the bias-variance trade-off in the error estimate, and pitfalls such as information leakage and the optimism of in-sample error. Its role in resampling-based assessment is emphasized.
Core questions
- How does holding out data and predicting it estimate generalization error?
- What trade-offs distinguish leave-one-out from k-fold cross-validation?
- How is cross-validation used to select models and tune hyperparameters?
- What practices, such as avoiding information leakage, are needed for valid estimates?
Key concepts
- k-fold partitioning
- Leave-one-out cross-validation
- Validation set
- Generalization error
- Model selection
- Information leakage
Key theories
- Cross-validatory assessment
- Fitting on one part of the data and evaluating on a disjoint part gives an estimate of prediction error that, averaged over folds, approximates the model's error on independent future data.
- Bias-variance in the error estimate
- Leave-one-out cross-validation is nearly unbiased but can have high variance, while k-fold with moderate k trades a small upward bias for lower variance, guiding the common choice of five or ten folds.
Clinical relevance
Cross-validation is the standard tool for choosing among models, tuning regularization and other hyperparameters, and reporting honest predictive performance; it is central to statistical learning and machine-learning practice across the data-driven sciences.
History
Cross-validatory ideas were formalized by Stone and Geisser in 1974 as a principled way to assess and choose predictive models; the explosive growth of statistical and machine learning made k-fold cross-validation a routine default for model evaluation.
Debates
- Bias and variance of the cross-validation estimate
- There is continuing discussion of how many folds to use and how to obtain valid uncertainty estimates for cross-validated error, since the folds overlap and the resulting error estimates are correlated.
Key figures
- Mervyn Stone
- Seymour Geisser
- Trevor Hastie
- Robert Tibshirani
Related topics
Seminal works
- stone1974
- hastie2009
Frequently asked questions
- Why not just measure error on the data used to fit the model?
- In-sample error is optimistic because the model has been tuned to that very data, so it understates error on new data. Cross-validation evaluates predictions on data the model did not see during fitting, giving a more honest estimate.
- How many folds should I use?
- Five or ten folds are common choices that balance bias and variance and keep computation manageable. Leave-one-out uses as many folds as observations, giving low bias but higher variance and greater cost.