Regularization
Preventing overfitting with a penalty
Regularization adds a penalty on model complexity to the fitting objective, shrinking coefficients toward zero to reduce variance and overfitting. L2 (ridge) shrinks all coefficients smoothly; L1 (lasso) can set some exactly to zero, performing variable selection; elastic net blends both. A tuning parameter controls the strength of the penalty, trading a little bias for a large reduction in variance.
Concept and Logic
Standard regression minimizes the residual sum of squares (RSS). Regularization augments this objective with a penalty term: for ridge, RSS + lambda * sum(beta_j^2); for lasso, RSS + lambda * sum(|beta_j|). Here lambda >= 0 is the tuning parameter that controls penalty strength. When lambda = 0 we recover ordinary least squares; as lambda grows, coefficients are pushed toward zero. This prevents the model from memorizing training data and instead encourages it to capture generalizable patterns. Elastic net blends both penalties, striking a balance between ridge and lasso.
Computation and Interpretation
The tuning parameter lambda is chosen via cross-validation, typically 10-fold: candidate values are evaluated and the one minimizing validation error is selected. Ridge pushes all coefficients toward zero but never exactly to zero; lasso sets some coefficients exactly to zero, effectively removing those predictors. Lasso output directly provides variable selection: non-zero coefficients indicate retained predictors. Coefficient magnitudes must be interpreted relative to lambda; a small coefficient under heavy penalization may not be substantively unimportant, it may simply be penalized.
Common Misconceptions
First misconception: larger lambda always improves the model. No; an excessively large lambda drives all coefficients to zero and the model learns nothing (high bias). Second misconception: ridge and lasso yield similar results. False; lasso produces sparse solutions while ridge solutions are never exactly sparse. Third misconception: regularization is only necessary for large datasets. On the contrary, it is most critical in high-dimensional problems with few observations. Fourth misconception: regularization introduces no bias. Wrong; shrinking coefficients toward zero is a form of bias — the reward is a large reduction in variance.
Why It Matters and How to Report
Regularization is the most common and practical tool for controlling overfitting; it is indispensable in high-dimensional genomics, text, and econometric data. When reporting, state which method was used (ridge, lasso, elastic net) and justify the choice; describe the lambda selection procedure (number of folds, criterion used); report the final lambda value and the corresponding test or cross-validation error. If lasso was used, report how many predictors were retained and which were zeroed out. When reporting coefficients, clarify whether they are standardized or unstandardized, since regularization is sensitive to variable scaling.
Sources
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. ISBN: 978-0-387-84857-0