Solution: Deriving the Normal Equations by Calculus
Part 1
L ( α , β ) = ∑ ( y i − α − β x i ) 2 L(\alpha, \beta) = \sum (y_i - \alpha - \beta x_i)^2 L ( α , β ) = ∑ ( y i − α − β x i ) 2 .
∂ L / ∂ α = − 2 ∑ ( y i − α − β x i ) = 0 \partial L/\partial\alpha = -2\sum(y_i - \alpha - \beta x_i) = 0 ∂ L / ∂ α = − 2 ∑ ( y i − α − β x i ) = 0 , giving
∑ y i = n α + β ∑ x i . (A) \sum y_i = n\alpha + \beta\sum x_i. \tag{A} ∑ y i = n α + β ∑ x i . ( A )
∂ L / ∂ β = − 2 ∑ x i ( y i − α − β x i ) = 0 \partial L/\partial\beta = -2\sum x_i(y_i - \alpha - \beta x_i) = 0 ∂ L / ∂ β = − 2 ∑ x i ( y i − α − β x i ) = 0 , giving
∑ x i y i = α ∑ x i + β ∑ x i 2 . (B) \sum x_i y_i = \alpha\sum x_i + \beta\sum x_i^2. \tag{B} ∑ x i y i = α ∑ x i + β ∑ x i 2 . ( B )
Part 2
From (A): α ^ = y ˉ − β ^ x ˉ \hat\alpha = \bar y - \hat\beta \bar x α ^ = y ˉ − β ^ x ˉ .
Substitute into (B):
∑ x i y i = ( y ˉ − β ^ x ˉ ) ∑ x i + β ^ ∑ x i 2 = y ˉ ∑ x i − β ^ x ˉ ∑ x i + β ^ ∑ x i 2 . \sum x_i y_i = (\bar y - \hat\beta\bar x)\sum x_i + \hat\beta\sum x_i^2 = \bar y\sum x_i - \hat\beta\bar x\sum x_i + \hat\beta\sum x_i^2. ∑ x i y i = ( y ˉ − β ^ x ˉ ) ∑ x i + β ^ ∑ x i 2 = y ˉ ∑ x i − β ^ x ˉ ∑ x i + β ^ ∑ x i 2 .
Using ∑ x i = n x ˉ \sum x_i = n\bar x ∑ x i = n x ˉ :
∑ x i y i = n x ˉ y ˉ − n β ^ x ˉ 2 + β ^ ∑ x i 2 . \sum x_i y_i = n\bar x\bar y - n\hat\beta\bar x^2 + \hat\beta\sum x_i^2. ∑ x i y i = n x ˉ y ˉ − n β ^ x ˉ 2 + β ^ ∑ x i 2 .
Rearrange:
β ^ ( ∑ x i 2 − n x ˉ 2 ) = ∑ x i y i − n x ˉ y ˉ . \hat\beta(\sum x_i^2 - n\bar x^2) = \sum x_i y_i - n\bar x\bar y. β ^ ( ∑ x i 2 − n x ˉ 2 ) = ∑ x i y i − n x ˉ y ˉ .
β ^ = ∑ x i y i − n x ˉ y ˉ ∑ x i 2 − n x ˉ 2 = ∑ ( x i − x ˉ ) ( y i − y ˉ ) ∑ ( x i − x ˉ ) 2 . ✓ \hat\beta = \frac{\sum x_i y_i - n\bar x\bar y}{\sum x_i^2 - n\bar x^2} = \frac{\sum (x_i - \bar x)(y_i - \bar y)}{\sum (x_i - \bar x)^2}. \quad \checkmark β ^ = ∑ x i 2 − n x ˉ 2 ∑ x i y i − n x ˉ y ˉ = ∑ ( x i − x ˉ ) 2 ∑ ( x i − x ˉ ) ( y i − y ˉ ) . ✓
(The last equality uses the algebraic identities ∑ x i y i − n x ˉ y ˉ = ∑ ( x i − x ˉ ) ( y i − y ˉ ) \sum x_iy_i - n\bar x\bar y = \sum(x_i - \bar x)(y_i - \bar y) ∑ x i y i − n x ˉ y ˉ = ∑ ( x i − x ˉ ) ( y i − y ˉ ) and ∑ x i 2 − n x ˉ 2 = ∑ ( x i − x ˉ ) 2 \sum x_i^2 - n\bar x^2 = \sum(x_i - \bar x)^2 ∑ x i 2 − n x ˉ 2 = ∑ ( x i − x ˉ ) 2 .)
Part 3
Sample covariance Cov ^ ( x , y ) = 1 n − 1 ∑ ( x i − x ˉ ) ( y i − y ˉ ) \widehat{\text{Cov}}(x, y) = \tfrac{1}{n-1}\sum(x_i - \bar x)(y_i - \bar y) Cov ( x , y ) = n − 1 1 ∑ ( x i − x ˉ ) ( y i − y ˉ ) and sample variance Var ^ ( x ) = 1 n − 1 ∑ ( x i − x ˉ ) 2 \widehat{\text{Var}}(x) = \tfrac{1}{n-1}\sum(x_i - \bar x)^2 Var ( x ) = n − 1 1 ∑ ( x i − x ˉ ) 2 . Their ratio:
Cov ^ ( x , y ) Var ^ ( x ) = ∑ ( x i − x ˉ ) ( y i − y ˉ ) ∑ ( x i − x ˉ ) 2 = β ^ . ✓ \frac{\widehat{\text{Cov}}(x, y)}{\widehat{\text{Var}}(x)} = \frac{\sum(x_i - \bar x)(y_i - \bar y)}{\sum(x_i - \bar x)^2} = \hat\beta. \quad \checkmark Var ( x ) Cov ( x , y ) = ∑ ( x i − x ˉ ) 2 ∑ ( x i − x ˉ ) ( y i − y ˉ ) = β ^ . ✓
The Bessel-correction factor ( n − 1 ) (n - 1) ( n − 1 ) cancels.
Part 4 — Numerical sanity check
import numpy as np
rng = np.random.default_rng( 0 )
n = 100
x = rng.standard_normal(n)
eps = rng.standard_normal(n)
y = 2 + 3 * x + eps
# closed-form
x_bar, y_bar = x.mean(), y.mean()
beta_hat = np. sum ((x - x_bar)*(y - y_bar)) / np. sum ((x - x_bar)** 2 )
alpha_hat = y_bar - beta_hat * x_bar
print ( f"alpha= {alpha_hat: .3 f} , beta= {beta_hat: .3 f} " )
# alpha=2.020, beta=3.094
Close to the true ( 2 , 3 ) (2, 3) ( 2 , 3 ) ; deviations are consistent with SE ( β ^ ) = σ / ∑ ( x i − x ˉ ) 2 ≈ 1 / 100 = 0.1 \text{SE}(\hat\beta) = \sigma/\sqrt{\sum(x_i - \bar x)^2} \approx 1/\sqrt{100} = 0.1 SE ( β ^ ) = σ / ∑ ( x i − x ˉ ) 2 ≈ 1/ 100 = 0.1 .
Takeaways
Normal equations emerge from setting partial derivatives of the squared-loss to zero. No calculus tricks; direct application of first-order conditions.
Closed form in simple (1-d) regression: β ^ \hat\beta β ^ is a covariance-variance ratio. This is the "rise over run" intuition for slope, made rigorous.
Sample covariance and variance forms are the numerator and denominator. Bessel's correction ( n − 1 ) (n - 1) ( n − 1 ) cancels in the ratio, so using "sum" or "sum divided by n − 1 n - 1 n − 1 " gives the same β ^ \hat\beta β ^ .
Standard error decreases as 1 / ∑ ( x i − x ˉ ) 2 1/\sqrt{\sum(x_i - \bar x)^2} 1/ ∑ ( x i − x ˉ ) 2 . More data and more variation in the predictor both improve precision.