Ridge Regression

Motivation: why this matters in quant finance

Ridge regression is the linear model you reach for when ordinary least squares is too jumpy. Quant features are often correlated: valuation ratios overlap, yield-curve points move together, and technical indicators reuse the same price history. OLS can fit such data while assigning unstable positive and negative coefficients.

Ridge keeps all features in the model but shrinks their coefficients toward zero. This makes it a natural baseline for dense signals where many predictors may each contain a little information.

The informal idea

OLS only asks for small residuals. Ridge asks for small residuals and moderate coefficients. If two spread features say nearly the same thing, ridge prefers sharing weight across them instead of using large offsetting coefficients.

Formal statement

Ridge regression solves

\hat{\boldsymbol{\beta}}=\arg\min_{\boldsymbol{\beta}} \frac{1}{n}\lVert \mathbf{y}-\mathbf{X}\boldsymbol{\beta}\rVert_2^2 + \alpha\lVert\boldsymbol{\beta}\rVert_2^2.

With standardised features, the closed form is

\hat{\boldsymbol{\beta}}=(\mathbf{X}^\top\mathbf{X}+\alpha\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}.

The $\alpha\mathbf{I}$ term improves conditioning and shrinks weak directions.

Implementation

import numpy as np

class RidgeRegression:
    """Ridge regression with centred data and unpenalised intercept."""
    def __init__(self, alpha: float = 1.0):
        self.alpha = alpha

    def fit(self, X: np.ndarray, y: np.ndarray):
        x_mean, y_mean = X.mean(axis=0), y.mean()
        Xc, yc = X - x_mean, y - y_mean
        self.coef_ = np.linalg.solve(Xc.T @ Xc + self.alpha * np.eye(X.shape[1]), Xc.T @ yc)
        self.intercept_ = y_mean - x_mean @ self.coef_
        return self

rng = np.random.default_rng(13)
base = rng.normal(size=200)
X = np.c_[base, base + 0.02 * rng.normal(size=200)]
y = 0.8 * base + rng.normal(scale=0.25, size=200)
ols = np.linalg.lstsq(np.c_[np.ones(len(X)), X], y, rcond=None)[0][1:]
ridge = RidgeRegression(alpha=5).fit(X, y).coef_
print(np.round(ols, 3), np.round(ridge, 3))
# [ 0.87  -0.061] [0.398 0.398]

Key properties and trade-offs

Property	Ridge behaviour	Quant use
Shrinks, rarely zeros	Coefficients move toward zero but remain active.	Good for dense factor models.
Stabilises collinearity	Adds $\alpha\mathbf{I}$ to the normal equations.	Useful for yield-curve and factor-library regressions.
Scale-sensitive	Units affect penalty strength.	Standardise predictors inside a pipeline.
Bias-variance trade-off	Larger $\alpha$ lowers variance but adds bias.	Select $\alpha$ by validation.

Worked example: correlated factors

If value and earnings yield have correlation near 0.95, OLS may rotate weight between them as the sample changes. Ridge treats them as a shared direction and spreads weight across them, which is often the better first-pass research assumption.

Common confusions and pitfalls

"Ridge selects features." Ridge shrinks features; it usually does not set them to zero.

"Bigger alpha is safer." Excessive shrinkage collapses predictions toward the intercept.

"Scaling is cosmetic." Without scaling, units decide which coefficients are cheap to penalise.

Where this goes next

Lasso Regression: uses L1 geometry to create exact zeros.
Regularisation: L1 vs L2: compares dense shrinkage and sparse selection.
Cross-Validation: selects $\alpha$ without contaminating the test set.
Matrix Factorisations: explains the numerical stability behind least-squares solvers.

References

Aurelien Geron (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly. Ch. 4 (Ridge Regression and regularised linear models).
Andrew Ng and Tengyu Ma (2023). CS229 Lecture Notes. Ch. 9 (Regularization and Model Selection).
Deisenroth, Faisal, and Ong (2020). Mathematics for Machine Learning. Ch. 8 (Model Selection) and Ch. 9 (Linear Regression).