Why not measure performance on the training data?

A model can fit its training data closely, including its noise, so training error underestimates the error on new data. Honest evaluation requires data the model has never seen, obtained through a held-out test set or cross-validation.

What is the difference between a validation set and a test set?

A validation set is used during development to tune hyperparameters and select models, while the test set is reserved for a single final assessment. Keeping them separate prevents the choices made during tuning from inflating the reported performance.

Model Evaluation and Selection

Model evaluation and selection are the methods for estimating how well a model will generalize and for choosing among competing models and settings.

Leia teema tööriistaga PaperMindPeagiFind papers & topics

Tools & resources

Laadi slaidid alla

Learn & explore

VideoPeagi

Definition

Model evaluation is the estimation of a model's expected performance on unseen data, and model selection is the use of such estimates to choose among models, features, or hyperparameter settings; both rely on separating data used for fitting from data used for assessment to obtain honest estimates of generalization.

Scope

This area covers the empirical methodology of machine learning: estimating generalization error by holding out data and by cross-validation, performance metrics for classification and regression, the search for good hyperparameters, and the control of model complexity through regularization. It addresses how to avoid optimistic bias from evaluating on training data and how to compare models fairly.

Sub-topics

Core questions

How can generalization error be estimated without overoptimism?
Which metrics correctly capture performance for a given task?
How are hyperparameters chosen without contaminating the evaluation?
How is model complexity tuned to the available data?

Key theories

Honest error estimation: Estimating performance on data not used for fitting, through held-out test sets or cross-validation, is essential because error measured on training data is optimistically biased.
Model selection and complexity control: Choosing among models requires balancing fit against complexity, using validation estimates or information criteria to select the model expected to generalize best.
Separation of selection and assessment: Hyperparameters must be tuned on validation data kept separate from the final test set, since reusing test data for selection produces overly optimistic performance estimates.

Clinical relevance

Sound evaluation methodology is what makes machine-learning results trustworthy; failures such as testing on training data, tuning on the test set, or choosing misleading metrics are common causes of models that look excellent in development but fail in deployment, making this area essential to responsible practice.

History

Cross-validation was formalized by Stone and others in the 1970s as a way to estimate prediction error, and information criteria such as Akaike's and the Bayesian criterion gave model-selection rules grounded in likelihood. As machine learning matured, rigorous train, validation, and test protocols and a wide range of performance metrics became standard practice.

Debates

Choosing the right metric: A single accuracy figure can mislead on imbalanced or cost-sensitive problems, prompting debate over which metrics best reflect real-world objectives and how to report performance honestly.

Key figures

Trevor Hastie
Robert Tibshirani
Mervyn Stone

Seminal works

hastie2009
bishop2006
murphy2012

Frequently asked questions

Why not measure performance on the training data?: A model can fit its training data closely, including its noise, so training error underestimates the error on new data. Honest evaluation requires data the model has never seen, obtained through a held-out test set or cross-validation.
What is the difference between a validation set and a test set?: A validation set is used during development to tune hyperparameters and select models, while the test set is reserved for a single final assessment. Keeping them separate prevents the choices made during tuning from inflating the reported performance.