ScholarGate
Assistant

Cross-Validation

Cross-validation estimates how well a model will predict new data by repeatedly fitting it on part of the sample and measuring its error on the held-out remainder.

Definition

Cross-validation is a resampling procedure that estimates the out-of-sample predictive error of a model by partitioning the data into complementary subsets, fitting on some subsets and evaluating prediction error on the others, and averaging over the partitions.

Scope

This topic covers leave-one-out and k-fold cross-validation, the validation-set and repeated cross-validation schemes, their use for model selection and tuning-parameter choice, the bias-variance trade-off in the error estimate, and pitfalls such as information leakage and the optimism of in-sample error. Its role in resampling-based assessment is emphasized.

Core questions

  • How does holding out data and predicting it estimate generalization error?
  • What trade-offs distinguish leave-one-out from k-fold cross-validation?
  • How is cross-validation used to select models and tune hyperparameters?
  • What practices, such as avoiding information leakage, are needed for valid estimates?

Key concepts

  • k-fold partitioning
  • Leave-one-out cross-validation
  • Validation set
  • Generalization error
  • Model selection
  • Information leakage

Key theories

Cross-validatory assessment
Fitting on one part of the data and evaluating on a disjoint part gives an estimate of prediction error that, averaged over folds, approximates the model's error on independent future data.
Bias-variance in the error estimate
Leave-one-out cross-validation is nearly unbiased but can have high variance, while k-fold with moderate k trades a small upward bias for lower variance, guiding the common choice of five or ten folds.

Clinical relevance

Cross-validation is the standard tool for choosing among models, tuning regularization and other hyperparameters, and reporting honest predictive performance; it is central to statistical learning and machine-learning practice across the data-driven sciences.

History

Cross-validatory ideas were formalized by Stone and Geisser in 1974 as a principled way to assess and choose predictive models; the explosive growth of statistical and machine learning made k-fold cross-validation a routine default for model evaluation.

Debates

Bias and variance of the cross-validation estimate
There is continuing discussion of how many folds to use and how to obtain valid uncertainty estimates for cross-validated error, since the folds overlap and the resulting error estimates are correlated.

Key figures

  • Mervyn Stone
  • Seymour Geisser
  • Trevor Hastie
  • Robert Tibshirani

Related topics

Seminal works

  • stone1974
  • hastie2009

Frequently asked questions

Why not just measure error on the data used to fit the model?
In-sample error is optimistic because the model has been tuned to that very data, so it understates error on new data. Cross-validation evaluates predictions on data the model did not see during fitting, giving a more honest estimate.
How many folds should I use?
Five or ten folds are common choices that balance bias and variance and keep computation manageable. Leave-one-out uses as many folds as observations, giving low bias but higher variance and greater cost.

Methods for this concept

Related concepts