Linear Regression Derivation

Motivation: why this matters in quant finance

Linear regression is the single most-used statistical tool in quantitative finance. Every time you:

Compute a stock's beta against a market index,
Estimate a factor model's loadings,
Build a statistical arbitrage signal from predictors,
Calibrate a yield-curve model to observed bond prices,
Hedge an option's residuals against linear risk factors,

— you are running a linear regression. The math is textbook but the derivation of the closed-form estimator

\hat\beta = (X^\top X)^{-1}X^\top y

is worth walking through carefully once: the derivation makes crystal clear why the estimator takes this form, when it is valid, what "best linear unbiased" means, and where the assumptions come in (Gauss-Markov).

The derivation also generalises cleanly: ridge regression shrinks by replacing $X^\top X$ with $X^\top X + \lambda I$ ; generalised least squares replaces the identity weight with $\Sigma^{-1}$ ; principal-component regression projects $X$ onto its top eigenvectors. Every one of these is a perturbation of the same $(X^\top X)^{-1}X^\top y$ formula.

The informal idea

We observe $n$ data points $(x_i, y_i)$ where $x_i \in \mathbb{R}^p$ is a feature vector and $y_i \in \mathbb{R}$ is a scalar target. We postulate a linear relationship

y_i = x_i^\top \beta + \epsilon_i,

with

\beta \in \mathbb{R}^p

an unknown vector of coefficients and

\epsilon_i

random noise. The goal: estimate $\beta$ from the data.

The idea of least squares is to choose

\hat\beta

to minimise the sum of squared residuals:

\hat\beta := \arg\min_\beta \sum_{i=1}^n (y_i - x_i^\top\beta)^2.

"Squared" error is chosen not because it has any special physical meaning (absolute error or Huber loss are equally reasonable) but because it yields a closed-form solution after a single round of differentiation. Squared loss is also the MLE under gaussian noise, connecting least squares to maximum likelihood estimation.

Matrix setup

Stack the data:

X = \begin{pmatrix}x_1^\top \\ x_2^\top \\ \vdots \\ x_n^\top\end{pmatrix} \in \mathbb{R}^{n \times p}, \quad y = \begin{pmatrix}y_1 \\ \vdots \\ y_n\end{pmatrix} \in \mathbb{R}^n, \quad \epsilon = \begin{pmatrix}\epsilon_1 \\ \vdots \\ \epsilon_n\end{pmatrix} \in \mathbb{R}^n.

The model is $y = X\beta + \epsilon$ . The residual vector is $r(\beta) = y - X\beta$ and the objective is

L(\beta) = \|r(\beta)\|_2^2 = (y - X\beta)^\top(y - X\beta).

Deriving $\hat\beta$

Expand the objective:

L(\beta) = y^\top y - 2\beta^\top X^\top y + \beta^\top X^\top X\beta.

(Used that $y^\top X\beta$ is a scalar, hence equal to its transpose $\beta^\top X^\top y$ ; so cross-terms combine into a single $-2$ .)

Take the gradient with respect to $\beta$ :

\nabla_\beta L = -2 X^\top y + 2 X^\top X\beta.

Set to zero:

X^\top X\beta = X^\top y. \tag{normal equations}

If $X^\top X$ is invertible (equivalently, columns of $X$ are linearly independent — "full rank $p$ "), solve:

\hat\beta = (X^\top X)^{-1}X^\top y.

Second-order check

The Hessian is $\nabla^2_\beta L = 2 X^\top X$ , which is positive semi-definite (PSD). If $X$ has full column rank, $X^\top X$ is positive definite, so $\hat\beta$ is the unique global minimum. Otherwise the solution is not unique (and regularisation is needed).

Geometric interpretation

The fitted values are $\hat y = X\hat\beta = X(X^\top X)^{-1}X^\top y = H y$ , where

H := X(X^\top X)^{-1}X^\top

is the hat matrix (projection onto the column space of

X

). Properties:

$H^2 = H$ (idempotent — $H$ is a projection).
$H^\top = H$ (symmetric).
$\text{rank}(H) = p$ (one per coefficient).
$(I - H)X = 0$ (residuals are orthogonal to features).

Geometric picture.

X\beta

traces out the column space of

X

— a

p

-dimensional subspace of

\mathbb{R}^n

. The least-squares problem is: find the closest point in this subspace to

y

. The answer is the orthogonal projection

\hat y = Hy

. The residual

y - \hat y = (I - H)y

is orthogonal to the column space, which is exactly the content of the normal equations

X^\top(y - X\hat\beta) = 0

Statistical properties under Gauss-Markov assumptions

Assume:

Linearity: $y = X\beta + \epsilon$ .
$\mathbb{E}[\epsilon] = 0$ .
$\text{Cov}(\epsilon) = \sigma^2 I_n$ (homoscedasticity — constant noise variance, uncorrelated across observations).
$X$ is non-random (deterministic features) and has full column rank.

Then:

Unbiasedness

\mathbb{E}[\hat\beta] = (X^\top X)^{-1}X^\top \mathbb{E}[y] = (X^\top X)^{-1}X^\top X\beta = \beta.

Covariance

\text{Cov}(\hat\beta) = (X^\top X)^{-1}X^\top\text{Cov}(y)X(X^\top X)^{-1} = \sigma^2(X^\top X)^{-1}.

So the standard error of each coefficient is $\text{SE}(\hat\beta_j) = \sigma\sqrt{[(X^\top X)^{-1}]_{jj}}$ .

BLUE — Gauss-Markov theorem

Under assumptions 1–4,

\hat\beta

is the Best Linear Unbiased Estimator — it has the smallest variance (in the PSD ordering) among all linear unbiased estimators of

\beta

. "Best" here means

\text{Cov}(\hat\beta) \preceq \text{Cov}(\tilde\beta)

for any other linear unbiased

\tilde\beta

. Proof via algebra; the intuition is that

\hat\beta

uses the data as efficiently as possible given the assumptions.

Canonical applications

Application 1 — Stock beta

Regress stock returns $r_t$ on market returns $r_{m,t}$ :

r_t = \alpha + \beta r_{m,t} + \epsilon_t.

With $X = \begin{pmatrix}\mathbf{1} & r_m\end{pmatrix}$ :

\hat\beta = \frac{\text{Cov}(r_t, r_{m,t})}{\text{Var}(r_{m,t})}.

The stock's beta is a covariance-ratio — the classic CAPM formula, derived directly from least squares. This is the basis of the Single Index Model and the foundation of factor risk models.

Application 2 — Fama-French three-factor model

Regress excess stock returns on three factors: market excess return, SMB (size), HML (value):

r_{i,t} - r_f = \alpha_i + \beta_{i,M}(r_{m,t} - r_f) + \beta_{i,\text{SMB}}\text{SMB}_t + \beta_{i,\text{HML}}\text{HML}_t + \epsilon_{i,t}.

Same formula $\hat\beta = (X^\top X)^{-1}X^\top y$ . The loadings $\hat\beta$ are the stock's factor exposures.

Application 3 — Yield curve fitting

Fit a Nelson-Siegel or Svensson parametric model to observed zero-coupon yields. The predictive model is linear in the Nelson-Siegel loadings (after fixing decay parameters), so least squares gives the closed-form fit.

Common pitfalls

" $(X^\top X)^{-1}$ always exists." No — if

X

has multicollinearity (columns linearly dependent or near-dependent),

X^\top X

is singular or ill-conditioned. Cures: drop redundant features, apply regularisation (ridge), or use PCA.

"Least squares is unbiased under any noise distribution." Unbiasedness needs only

\mathbb{E}[\epsilon] = 0

. BLUE-ness (optimality) needs the Gauss-Markov conditions, especially constant-variance uncorrelated noise. Heteroscedastic noise (e.g. GARCH returns) breaks BLUE; weighted least squares is the cure.

"OLS assumes gaussian noise." No — OLS is the MLE if noise is gaussian, but OLS itself doesn't require gaussianity. The Gauss-Markov theorem requires only the second-moment assumptions above.

"High $R^2$ means a good model."

R^2 = 1 - \|y - \hat y\|^2/\|y - \bar y\|^2

measures in-sample fit. A model can overfit and have high

R^2

while being useless out-of-sample. Time-series regressions especially are prone to spurious fits; always hold out validation data.

" $\hat\beta_j$ is significant iff $|t_j| > 2$ ." Significance depends on the t-statistic distribution under the null. Under the Gauss-Markov + gaussian errors,

t_j = \hat\beta_j / \text{SE}(\hat\beta_j) \sim t_{n-p}

. For

n \gg p

t_{n-p} \approx \mathcal{N}(0, 1)

and

|t| > 1.96

gives 5% significance. Watch out for small-sample corrections and autocorrelation in residuals (Newey-West SE's for time series).

Where this goes next

Covariance Matrices: Sample covariance $X^\top X/n$ is the matrix being inverted.
Maximum Likelihood Estimation: Under gaussian errors, OLS = MLE.
Linear Regression (programming lesson): Implementation details, gradient descent, regularisation.
Correlation and Dependence: Stock beta is a covariance-normalised regression coefficient.
Capital Asset Pricing Model (CAPM): Beta from regression is the CAPM's central parameter (future lesson).