Regularization and Model Complexity
Regularization controls model complexity by penalizing or constraining a model, reducing overfitting and improving generalization.
Definition
Regularization is any modification to a learning procedure that reduces its tendency to overfit, typically by adding a penalty on model complexity to the loss or by constraining the model, so that the fitted model generalizes better even at the cost of slightly worse fit to the training data.
Scope
This topic covers techniques for controlling complexity: L2 and L1 penalties on parameters, early stopping, dropout and data augmentation in neural networks, and information criteria that penalize complexity in model selection. It frames regularization as encoding a preference for simpler models and connects it to the Bayesian view of priors over parameters.
Core questions
- How do complexity penalties reduce overfitting?
- How do L1 and L2 penalties differ in their effect?
- What regularization methods are specific to neural networks?
- How does regularization relate to the Bayesian use of priors?
Key theories
- Penalized loss
- Adding a penalty on parameter magnitude to the training loss discourages overly complex solutions, with L2 shrinking coefficients smoothly and L1 promoting sparsity by setting some to zero.
- Regularization in deep learning
- Techniques such as early stopping, dropout, weight decay, and data augmentation control the effective complexity of neural networks, which would otherwise overfit given their large capacity.
- Bayesian interpretation
- A complexity penalty corresponds to a prior over parameters, so regularized estimation can be read as finding the most probable parameters under that prior, linking regularization to Bayesian inference.
Clinical relevance
Regularization is one of the most important practical tools for making models generalize, and it is essential when models have high capacity relative to the data, as in modern deep networks; the right amount and form of regularization is itself a tuning problem central to building reliable models.
History
Penalized estimation goes back to Tikhonov regularization for ill-posed problems and to ridge regression in statistics, with the lasso later adding sparsity. In deep learning, methods such as dropout, introduced around 2012, and weight decay and data augmentation became standard means of controlling the large capacity of neural networks.
Key figures
- Andrey Tikhonov
- Robert Tibshirani
- Geoffrey Hinton
Related topics
Seminal works
- hastie2009
- goodfellow2016
- tibshirani1996
Frequently asked questions
- What does regularization do?
- It discourages a model from becoming too complex, usually by adding a penalty on the size of its parameters or by constraining training. This reduces overfitting, so the model captures the underlying pattern rather than the noise and performs better on new data.
- Why does L1 regularization produce sparse models?
- The L1 penalty on the absolute value of parameters has a shape that drives some coefficients exactly to zero rather than just shrinking them. This effectively removes the corresponding features, yielding a simpler, more interpretable model.