Training, Validation, and Test Sets

Splitting data for honest evaluation

To estimate how a model performs on unseen data, the dataset is divided into three parts: a training set to fit the model, a validation set to tune hyperparameters and select among competing models, and a held-out test set used only once for final evaluation. Cross-validation rotates these splits to use data more efficiently. Data leakage — allowing test information to influence training — silently inflates performance estimates and is a critical error that must be avoided.

The Concept and Its Logic

In machine learning and statistical modelling, the true goal is a model that works well on new, unseen data. If the same data are used for both learning and evaluation, performance estimates become unrealistically optimistic — a problem called overfitting. To address this, data are divided into three layers: the training set learns the model parameters, the validation set is used to tune hyperparameters such as regularisation strength or tree depth, and the test set is applied only once at the final decision stage. This separation provides an honest window onto real-world performance.

Splitting and Cross-Validation

A common practice is to split data roughly 60 percent for training, 20 percent for validation, and 20 percent for testing, though these ratios shift with dataset size. When data are scarce, k-fold cross-validation is more efficient: the data are divided into k equal folds, one fold is held out as the validation set in each iteration, and the remaining k-1 folds are used for training; this repeats k times and errors are averaged. Stratified k-fold preserves class proportions in each fold, giving more reliable results on imbalanced datasets. The test set must be kept entirely outside the cross-validation loop.

Common Misuses and Misconceptions

The most common error is data leakage. When preprocessing steps such as standardisation, feature selection, or resampling are applied to the entire dataset before splitting, information from the test set bleeds into training, creating an illusion of success that does not exist in practice. The correct approach is to fit all transformations only on the training set and then apply them to the validation and test sets. Another frequent mistake is using the test set multiple times after selecting the best model on the validation set; this silently turns the test set into a second validation set and makes the final estimate overly optimistic.

Why It Matters and How to Report It

Correct splitting measures a model's true generalisation ability and prevents researchers from publishing misleadingly optimistic findings driven by overfitting. When reporting, the splitting strategy, proportions, and random seed must be stated explicitly. If cross-validation was used, the number of folds and whether stratification was applied should be disclosed. Test-set performance should be reported only once and independently of hyperparameter selection. Presenting both mean and standard deviation of performance estimates strengthens the credibility of findings. Transparent reporting directly affects the reproducibility and scientific integrity of a study.

Sources

  1. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. ISBN: 978-0-387-84857-0