Solution: Regression as Projection — Computing the Hat Matrix

Exercise: Regression as Projection: Computing the Hat Matrix

Part 1 — Symmetry

$(X^\top X)^{-1}$ is symmetric (it's the inverse of a symmetric matrix). So:

H^\top = (X(X^\top X)^{-1}X^\top)^\top = X((X^\top X)^{-1})^\top X^\top = X(X^\top X)^{-1}X^\top = H. \quad \checkmark

Part 2 — Idempotence

H^2 = X(X^\top X)^{-1}X^\top\cdot X(X^\top X)^{-1}X^\top = X(X^\top X)^{-1}(X^\top X)(X^\top X)^{-1}X^\top = X(X^\top X)^{-1}X^\top = H. \quad \checkmark

Part 3 — Trace

\text{tr}(H) = \text{tr}(X(X^\top X)^{-1}X^\top) = \text{tr}((X^\top X)^{-1}X^\top X) = \text{tr}(I_p) = p.

Part 4 — Residuals orthogonal to columns

$X^\top e = X^\top(y - X\hat\beta) = X^\top y - X^\top X(X^\top X)^{-1}X^\top y = X^\top y - X^\top y = 0$ . ✓

Part 5 — Numerical example

import numpy as np

X = np.array([[1, 0], [1, 1], [1, 2], [1, 3]])
y = np.array([1, 2, 2, 3])

XtX_inv = np.linalg.inv(X.T @ X)
beta_hat = XtX_inv @ X.T @ y
print("beta_hat:", beta_hat)
# beta_hat: [1.1 0.6]

H = X @ XtX_inv @ X.T
print("H =\n", H.round(3))
# H =
#  [[0.7  0.4  0.1 -0.2]
#   [0.4  0.3  0.2  0.1]
#   [0.1  0.2  0.3  0.4]
#   [-0.2 0.1  0.4  0.7]]

print("trace(H):", np.trace(H))
# trace(H): 2.0  (= p ✓)

y_hat = H @ y
e = y - y_hat
print("y_hat:", y_hat)
print("e:", e)
# y_hat: [1.1 1.7 2.3 2.9]
# e: [-0.1 0.3 -0.3 0.1]

print("X^T e:", X.T @ e)
# X^T e: [ 1.11e-16  -2.22e-16]  (numerically zero ✓)

print("H^2 == H:", np.allclose(H @ H, H))
# H^2 == H: True

Takeaways

Hat matrix is a projection: symmetric and idempotent. Geometrically $H$ projects $\mathbb{R}^n$ onto the column space of $X$ .
$\text{tr}(H) = p$ — the "effective degrees of freedom" of the regression. In regularisation, this becomes $\text{tr}(H_\lambda) < p$ , giving a meaningful measure of model complexity.
Residuals orthogonal to features. $X^\top e = 0$ is the first-order condition for least squares; it says no linear combination of features can further reduce residual squared-norm.
Leverage points have large $H_{ii}$ . Diagonal entries $H_{ii} \in [0, 1]$ with $\sum H_{ii} = p$ ; large values flag observations with unusual feature vectors that have outsized influence on $\hat\beta$ . Standard diagnostic for robust regression.