CONTENTS

Linear Regression Derivation

Motivation: why this matters in quant finance

Linear regression is the single most-used statistical tool in quantitative finance. Every time you:
  • Compute a stock's beta against a market index,
  • Estimate a factor model's loadings,
  • Build a statistical arbitrage signal from predictors,
  • Calibrate a yield-curve model to observed bond prices,
  • Hedge an option's residuals against linear risk factors,
— you are running a linear regression. The math is textbook but the derivation of the closed-form estimator β^=(XX)1Xy\hat\beta = (X^\top X)^{-1}X^\top y is worth walking through carefully once: the derivation makes crystal clear why the estimator takes this form, when it is valid, what "best linear unbiased" means, and where the assumptions come in (Gauss-Markov).

The derivation also generalises cleanly: ridge regression shrinks by replacing XXX^\top X with XX+λIX^\top X + \lambda I; generalised least squares replaces the identity weight with Σ1\Sigma^{-1}; principal-component regression projects XX onto its top eigenvectors. Every one of these is a perturbation of the same (XX)1Xy(X^\top X)^{-1}X^\top y formula.

The informal idea

We observe nn data points (xi,yi)(x_i, y_i) where xiRpx_i \in \mathbb{R}^p is a feature vector and yiRy_i \in \mathbb{R} is a scalar target. We postulate a linear relationship

yi=xiβ+ϵi,y_i = x_i^\top \beta + \epsilon_i,
with βRp\beta \in \mathbb{R}^p an unknown vector of coefficients and ϵi\epsilon_i random noise. The goal: estimate β\beta from the data.
The idea of least squares is to choose β^\hat\beta to minimise the sum of squared residuals:
β^:=argminβi=1n(yixiβ)2.\hat\beta := \arg\min_\beta \sum_{i=1}^n (y_i - x_i^\top\beta)^2.
"Squared" error is chosen not because it has any special physical meaning (absolute error or Huber loss are equally reasonable) but because it yields a closed-form solution after a single round of differentiation. Squared loss is also the MLE under gaussian noise, connecting least squares to maximum likelihood estimation.

Matrix setup

Stack the data:

X=(x1x2xn)Rn×p,y=(y1yn)Rn,ϵ=(ϵ1ϵn)Rn.X = \begin{pmatrix}x_1^\top \\ x_2^\top \\ \vdots \\ x_n^\top\end{pmatrix} \in \mathbb{R}^{n \times p}, \quad y = \begin{pmatrix}y_1 \\ \vdots \\ y_n\end{pmatrix} \in \mathbb{R}^n, \quad \epsilon = \begin{pmatrix}\epsilon_1 \\ \vdots \\ \epsilon_n\end{pmatrix} \in \mathbb{R}^n.

The model is y=Xβ+ϵy = X\beta + \epsilon. The residual vector is r(β)=yXβr(\beta) = y - X\beta and the objective is

L(β)=r(β)22=(yXβ)(yXβ).L(\beta) = \|r(\beta)\|_2^2 = (y - X\beta)^\top(y - X\beta).

Deriving β^\hat\beta

Expand the objective:

L(β)=yy2βXy+βXXβ.L(\beta) = y^\top y - 2\beta^\top X^\top y + \beta^\top X^\top X\beta.

(Used that yXβy^\top X\beta is a scalar, hence equal to its transpose βXy\beta^\top X^\top y; so cross-terms combine into a single 2-2.)

Take the gradient with respect to β\beta:

βL=2Xy+2XXβ.\nabla_\beta L = -2 X^\top y + 2 X^\top X\beta.

Set to zero:

XXβ=Xy.(normal equations)X^\top X\beta = X^\top y. \tag{normal equations}

If XXX^\top X is invertible (equivalently, columns of XX are linearly independent — "full rank pp"), solve:

β^=(XX)1Xy.\hat\beta = (X^\top X)^{-1}X^\top y.

Second-order check

The Hessian is β2L=2XX\nabla^2_\beta L = 2 X^\top X, which is positive semi-definite (PSD). If XX has full column rank, XXX^\top X is positive definite, so β^\hat\beta is the unique global minimum. Otherwise the solution is not unique (and regularisation is needed).

Geometric interpretation

The fitted values are y^=Xβ^=X(XX)1Xy=Hy\hat y = X\hat\beta = X(X^\top X)^{-1}X^\top y = H y, where

H:=X(XX)1XH := X(X^\top X)^{-1}X^\top
is the hat matrix (projection onto the column space of XX). Properties:
  • H2=HH^2 = H (idempotent — HH is a projection).
  • H=HH^\top = H (symmetric).
  • rank(H)=p\text{rank}(H) = p (one per coefficient).
  • (IH)X=0(I - H)X = 0 (residuals are orthogonal to features).
Geometric picture. XβX\beta traces out the column space of XX — a pp-dimensional subspace of Rn\mathbb{R}^n. The least-squares problem is: find the closest point in this subspace to yy. The answer is the orthogonal projection y^=Hy\hat y = Hy. The residual yy^=(IH)yy - \hat y = (I - H)y is orthogonal to the column space, which is exactly the content of the normal equations X(yXβ^)=0X^\top(y - X\hat\beta) = 0.

Statistical properties under Gauss-Markov assumptions

Assume:

  1. Linearity: y=Xβ+ϵy = X\beta + \epsilon.
  2. E[ϵ]=0\mathbb{E}[\epsilon] = 0.
  3. Cov(ϵ)=σ2In\text{Cov}(\epsilon) = \sigma^2 I_n (homoscedasticity — constant noise variance, uncorrelated across observations).
  4. XX is non-random (deterministic features) and has full column rank.

Then:

Unbiasedness

E[β^]=(XX)1XE[y]=(XX)1XXβ=β.\mathbb{E}[\hat\beta] = (X^\top X)^{-1}X^\top \mathbb{E}[y] = (X^\top X)^{-1}X^\top X\beta = \beta.

Covariance

Cov(β^)=(XX)1XCov(y)X(XX)1=σ2(XX)1.\text{Cov}(\hat\beta) = (X^\top X)^{-1}X^\top\text{Cov}(y)X(X^\top X)^{-1} = \sigma^2(X^\top X)^{-1}.

So the standard error of each coefficient is SE(β^j)=σ[(XX)1]jj\text{SE}(\hat\beta_j) = \sigma\sqrt{[(X^\top X)^{-1}]_{jj}}.

BLUE — Gauss-Markov theorem

Under assumptions 1–4, β^\hat\beta is the Best Linear Unbiased Estimator — it has the smallest variance (in the PSD ordering) among all linear unbiased estimators of β\beta. "Best" here means Cov(β^)Cov(β~)\text{Cov}(\hat\beta) \preceq \text{Cov}(\tilde\beta) for any other linear unbiased β~\tilde\beta. Proof via algebra; the intuition is that β^\hat\beta uses the data as efficiently as possible given the assumptions.

Canonical applications

Application 1 — Stock beta

Regress stock returns rtr_t on market returns rm,tr_{m,t}:

rt=α+βrm,t+ϵt.r_t = \alpha + \beta r_{m,t} + \epsilon_t.

With X=(1rm)X = \begin{pmatrix}\mathbf{1} & r_m\end{pmatrix}:

β^=Cov(rt,rm,t)Var(rm,t).\hat\beta = \frac{\text{Cov}(r_t, r_{m,t})}{\text{Var}(r_{m,t})}.

The stock's beta is a covariance-ratio — the classic CAPM formula, derived directly from least squares. This is the basis of the Single Index Model and the foundation of factor risk models.

Application 2 — Fama-French three-factor model

Regress excess stock returns on three factors: market excess return, SMB (size), HML (value):

ri,trf=αi+βi,M(rm,trf)+βi,SMBSMBt+βi,HMLHMLt+ϵi,t.r_{i,t} - r_f = \alpha_i + \beta_{i,M}(r_{m,t} - r_f) + \beta_{i,\text{SMB}}\text{SMB}_t + \beta_{i,\text{HML}}\text{HML}_t + \epsilon_{i,t}.

Same formula β^=(XX)1Xy\hat\beta = (X^\top X)^{-1}X^\top y. The loadings β^\hat\beta are the stock's factor exposures.

Application 3 — Yield curve fitting

Fit a Nelson-Siegel or Svensson parametric model to observed zero-coupon yields. The predictive model is linear in the Nelson-Siegel loadings (after fixing decay parameters), so least squares gives the closed-form fit.

Common pitfalls

"(XX)1(X^\top X)^{-1} always exists." No — if XX has multicollinearity (columns linearly dependent or near-dependent), XXX^\top X is singular or ill-conditioned. Cures: drop redundant features, apply regularisation (ridge), or use PCA.
"Least squares is unbiased under any noise distribution." Unbiasedness needs only E[ϵ]=0\mathbb{E}[\epsilon] = 0. BLUE-ness (optimality) needs the Gauss-Markov conditions, especially constant-variance uncorrelated noise. Heteroscedastic noise (e.g. GARCH returns) breaks BLUE; weighted least squares is the cure.
"OLS assumes gaussian noise." No — OLS is the MLE if noise is gaussian, but OLS itself doesn't require gaussianity. The Gauss-Markov theorem requires only the second-moment assumptions above.
"High R2R^2 means a good model." R2=1yy^2/yyˉ2R^2 = 1 - \|y - \hat y\|^2/\|y - \bar y\|^2 measures in-sample fit. A model can overfit and have high R2R^2 while being useless out-of-sample. Time-series regressions especially are prone to spurious fits; always hold out validation data.
"β^j\hat\beta_j is significant iff tj>2|t_j| > 2." Significance depends on the t-statistic distribution under the null. Under the Gauss-Markov + gaussian errors, tj=β^j/SE(β^j)tnpt_j = \hat\beta_j / \text{SE}(\hat\beta_j) \sim t_{n-p}. For npn \gg p, tnpN(0,1)t_{n-p} \approx \mathcal{N}(0, 1) and t>1.96|t| > 1.96 gives 5% significance. Watch out for small-sample corrections and autocorrelation in residuals (Newey-West SE's for time series).

Where this goes next

Exercises

Test your understanding with 3 exercises for this lesson.