Regularisation: L1 vs L2

Motivation: why this matters in quant finance

Regularisation is how a model pays rent for complexity. In finance, where signals are weak and features are correlated, an unpenalised model can fit historical noise with impressive in-sample statistics. L1 and L2 penalties are the two basic controls.

This is a comparison note, not a third derivation of Ridge Regression or Lasso Regression. Ridge explains dense L2 shrinkage. Lasso explains sparse L1 selection. Here the goal is to decide which geometry matches the modelling problem.

The informal idea

L2 regularisation punishes squared coefficient length. It prefers many small weights. L1 regularisation punishes absolute coefficient length. It can make some weights exactly zero.

Dense factor forecast? Start with L2. Sparse scorecard or feature screening? Consider L1. Unsure? Compare both inside Cross-Validation, not on the test set.

Formal statement

For a loss function $L(\boldsymbol{\beta})$ , the two common penalties are

L_2:\quad L(\boldsymbol{\beta})+\alpha\sum_{j=1}^p\beta_j^2,

and

L_1:\quad L(\boldsymbol{\beta})+\alpha\sum_{j=1}^p |\beta_j|.

The L2 ball is round, so shrinkage is smooth. The L1 ball has corners, so optima often land on axes and produce zeros.

Implementation

import numpy as np
from sklearn.linear_model import Lasso, Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

rng = np.random.default_rng(41)
X = rng.normal(size=(250, 8))
y = 1.2 * X[:, 0] - 0.9 * X[:, 1] + 0.4 * rng.normal(size=250)
ridge = make_pipeline(StandardScaler(), Ridge(alpha=3.0)).fit(X, y)
lasso = make_pipeline(StandardScaler(), Lasso(alpha=0.08, max_iter=10_000)).fit(X, y)
print(np.round(ridge[-1].coef_, 2))
print(np.round(lasso[-1].coef_, 2))
# [ 1.16 -0.82  0.02 -0.01  0.04 -0.03  0.01 -0.03]
# [ 1.08 -0.75  0.    0.    0.    0.    0.    0.  ]

Key comparison

Question	Prefer L2 / ridge	Prefer L1 / lasso
Many features weakly useful?	Yes	No
Correlated features represent one idea?	Often	Use carefully
Exact feature selection required?	No	Yes
Coefficient stability priority?	Usually	Not always
Compact scorecard needed?	Maybe	Often

Common confusions and pitfalls

"Regularisation fixes leakage." It controls coefficient complexity. Leakage can still produce excellent validation numbers and useless live performance.

"L1 is the interpretable option." It is sparse, but sparsity can be unstable when predictors are correlated.

"L2 is inferior because it keeps all features." Dense shrinkage is often the right assumption for factor models.

Where this goes next

Ridge Regression: develops dense L2 shrinkage.
Lasso Regression: develops sparse L1 selection.
Logistic Regression: uses the same penalty choices for classification.
Cross-Validation: selects penalties without contaminating the test set.

References

Aurelien Geron (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly. Ch. 4 (Ridge, Lasso, Elastic Net).
Andrew Ng and Tengyu Ma (2023). CS229 Lecture Notes. Ch. 9 (Regularization and Model Selection).
Avrim Blum, John Hopcroft, and Ravindran Kannan (2020). Foundations of Data Science. Ch. 5.5-5.7 (Overfitting and Regularization).