Linear regression is the single most-used statistical tool in quantitative finance. Every time you:
Compute a stock's beta against a market index,
Estimate a factor model's loadings,
Build a statistical arbitrage signal from predictors,
Calibrate a yield-curve model to observed bond prices,
Hedge an option's residuals against linear risk factors,
— you are running a linear regression. The math is textbook but the derivation of the closed-form estimatorβ^=(X⊤X)−1X⊤y is worth walking through carefully once: the derivation makes crystal clear why the estimator takes this form, when it is valid, what "best linear unbiased" means, and where the assumptions come in (Gauss-Markov).
The derivation also generalises cleanly: ridge regression shrinks by replacing X⊤X with X⊤X+λI; generalised least squares replaces the identity weight with Σ−1; principal-component regression projects X onto its top eigenvectors. Every one of these is a perturbation of the same (X⊤X)−1X⊤y formula.
The informal idea
We observe n data points (xi,yi) where xi∈Rp is a feature vector and yi∈R is a scalar target. We postulate a linear relationship
yi=xi⊤β+ϵi,
with β∈Rp an unknown vector of coefficients and ϵi random noise. The goal: estimate β from the data.
The idea of least squares is to choose β^ to minimise the sum of squared residuals:
β^:=argβmini=1∑n(yi−xi⊤β)2.
"Squared" error is chosen not because it has any special physical meaning (absolute error or Huber loss are equally reasonable) but because it yields a closed-form solution after a single round of differentiation. Squared loss is also the MLE under gaussian noise, connecting least squares to maximum likelihood estimation.
The model is y=Xβ+ϵ. The residual vector is r(β)=y−Xβ and the objective is
L(β)=∥r(β)∥22=(y−Xβ)⊤(y−Xβ).
Deriving β^
Expand the objective:
L(β)=y⊤y−2β⊤X⊤y+β⊤X⊤Xβ.
(Used that y⊤Xβ is a scalar, hence equal to its transpose β⊤X⊤y; so cross-terms combine into a single −2.)
Take the gradient with respect to β:
∇βL=−2X⊤y+2X⊤Xβ.
Set to zero:
X⊤Xβ=X⊤y.(normal equations)
If X⊤X is invertible (equivalently, columns of X are linearly independent — "full rank p"), solve:
β^=(X⊤X)−1X⊤y.
Second-order check
The Hessian is ∇β2L=2X⊤X, which is positive semi-definite (PSD). If X has full column rank, X⊤X is positive definite, so β^ is the unique global minimum. Otherwise the solution is not unique (and regularisation is needed).
Geometric interpretation
The fitted values are y^=Xβ^=X(X⊤X)−1X⊤y=Hy, where
H:=X(X⊤X)−1X⊤
is the hat matrix (projection onto the column space of X). Properties:
H2=H (idempotent — H is a projection).
H⊤=H (symmetric).
rank(H)=p (one per coefficient).
(I−H)X=0 (residuals are orthogonal to features).
Geometric picture.Xβ traces out the column space of X — a p-dimensional subspace of Rn. The least-squares problem is: find the closest point in this subspace to y. The answer is the orthogonal projectiony^=Hy. The residual y−y^=(I−H)y is orthogonal to the column space, which is exactly the content of the normal equations X⊤(y−Xβ^)=0.
Statistical properties under Gauss-Markov assumptions
Assume:
Linearity: y=Xβ+ϵ.
E[ϵ]=0.
Cov(ϵ)=σ2In (homoscedasticity — constant noise variance, uncorrelated across observations).
X is non-random (deterministic features) and has full column rank.
Then:
Unbiasedness
E[β^]=(X⊤X)−1X⊤E[y]=(X⊤X)−1X⊤Xβ=β.
Covariance
Cov(β^)=(X⊤X)−1X⊤Cov(y)X(X⊤X)−1=σ2(X⊤X)−1.
So the standard error of each coefficient is SE(β^j)=σ[(X⊤X)−1]jj.
BLUE — Gauss-Markov theorem
Under assumptions 1–4, β^ is the Best Linear Unbiased Estimator — it has the smallest variance (in the PSD ordering) among all linear unbiased estimators of β. "Best" here means Cov(β^)⪯Cov(β~) for any other linear unbiased β~. Proof via algebra; the intuition is that β^ uses the data as efficiently as possible given the assumptions.
Canonical applications
Application 1 — Stock beta
Regress stock returns rt on market returns rm,t:
rt=α+βrm,t+ϵt.
With X=(1rm):
β^=Var(rm,t)Cov(rt,rm,t).
The stock's beta is a covariance-ratio — the classic CAPM formula, derived directly from least squares. This is the basis of the Single Index Model and the foundation of factor risk models.
Application 2 — Fama-French three-factor model
Regress excess stock returns on three factors: market excess return, SMB (size), HML (value):
Same formula β^=(X⊤X)−1X⊤y. The loadings β^ are the stock's factor exposures.
Application 3 — Yield curve fitting
Fit a Nelson-Siegel or Svensson parametric model to observed zero-coupon yields. The predictive model is linear in the Nelson-Siegel loadings (after fixing decay parameters), so least squares gives the closed-form fit.
Common pitfalls
"(X⊤X)−1 always exists." No — if X has multicollinearity (columns linearly dependent or near-dependent), X⊤X is singular or ill-conditioned. Cures: drop redundant features, apply regularisation (ridge), or use PCA.
"Least squares is unbiased under any noise distribution." Unbiasedness needs only E[ϵ]=0. BLUE-ness (optimality) needs the Gauss-Markov conditions, especially constant-variance uncorrelated noise. Heteroscedastic noise (e.g. GARCH returns) breaks BLUE; weighted least squares is the cure.
"OLS assumes gaussian noise." No — OLS is the MLE if noise is gaussian, but OLS itself doesn't require gaussianity. The Gauss-Markov theorem requires only the second-moment assumptions above.
"High R2 means a good model."R2=1−∥y−y^∥2/∥y−yˉ∥2 measures in-sample fit. A model can overfit and have high R2 while being useless out-of-sample. Time-series regressions especially are prone to spurious fits; always hold out validation data.
"β^j is significant iff ∣tj∣>2." Significance depends on the t-statistic distribution under the null. Under the Gauss-Markov + gaussian errors, tj=β^j/SE(β^j)∼tn−p. For n≫p, tn−p≈N(0,1) and ∣t∣>1.96 gives 5% significance. Watch out for small-sample corrections and autocorrelation in residuals (Newey-West SE's for time series).