Regularisation: L1 vs L2
Motivation: why this matters in quant finance
Regularisation is how a model pays rent for complexity. In finance, where signals are weak and features are correlated, an unpenalised model can fit historical noise with impressive in-sample statistics. L1 and L2 penalties are the two basic controls.
This is a comparison note, not a third derivation of Ridge Regression or Lasso Regression. Ridge explains dense L2 shrinkage. Lasso explains sparse L1 selection. Here the goal is to decide which geometry matches the modelling problem.
The informal idea
L2 regularisation punishes squared coefficient length. It prefers many small weights. L1 regularisation punishes absolute coefficient length. It can make some weights exactly zero.
Dense factor forecast? Start with L2. Sparse scorecard or feature screening? Consider L1. Unsure? Compare both inside Cross-Validation, not on the test set.
Formal statement
For a loss function , the two common penalties are
and
The L2 ball is round, so shrinkage is smooth. The L1 ball has corners, so optima often land on axes and produce zeros.
Implementation
import numpy as np
from sklearn.linear_model import Lasso, Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
rng = np.random.default_rng(41)
X = rng.normal(size=(250, 8))
y = 1.2 * X[:, 0] - 0.9 * X[:, 1] + 0.4 * rng.normal(size=250)
ridge = make_pipeline(StandardScaler(), Ridge(alpha=3.0)).fit(X, y)
lasso = make_pipeline(StandardScaler(), Lasso(alpha=0.08, max_iter=10_000)).fit(X, y)
print(np.round(ridge[-1].coef_, 2))
print(np.round(lasso[-1].coef_, 2))
# [ 1.16 -0.82 0.02 -0.01 0.04 -0.03 0.01 -0.03]
# [ 1.08 -0.75 0. 0. 0. 0. 0. 0. ]Key comparison
| Question | Prefer L2 / ridge | Prefer L1 / lasso |
|---|---|---|
| Many features weakly useful? | Yes | No |
| Correlated features represent one idea? | Often | Use carefully |
| Exact feature selection required? | No | Yes |
| Coefficient stability priority? | Usually | Not always |
| Compact scorecard needed? | Maybe | Often |
Common confusions and pitfalls
"Regularisation fixes leakage." It controls coefficient complexity. Leakage can still produce excellent validation numbers and useless live performance.
"L1 is the interpretable option." It is sparse, but sparsity can be unstable when predictors are correlated.
"L2 is inferior because it keeps all features." Dense shrinkage is often the right assumption for factor models.
Where this goes next
- Ridge Regression: develops dense L2 shrinkage.
- Lasso Regression: develops sparse L1 selection.
- Logistic Regression: uses the same penalty choices for classification.
- Cross-Validation: selects penalties without contaminating the test set.
References
- Aurelien Geron (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly. Ch. 4 (Ridge, Lasso, Elastic Net).
- Andrew Ng and Tengyu Ma (2023). CS229 Lecture Notes. Ch. 9 (Regularization and Model Selection).
- Avrim Blum, John Hopcroft, and Ravindran Kannan (2020). Foundations of Data Science. Ch. 5.5-5.7 (Overfitting and Regularization).