CONTENTS

Maximum Likelihood Estimation

Motivation: why this matters in quant finance

Maximum Likelihood Estimation (MLE) is the default method for fitting probabilistic models to data in quant finance. Given a parametric model with unknown parameters θ\theta, MLE chooses θ^\hat\theta to maximise the probability of the observed data:
θ^MLE:=argmaxθp(dataθ).\hat\theta_{\text{MLE}} := \arg\max_\theta p(\text{data} \mid \theta).

In quant-finance practice, MLE underlies:

  • GARCH volatility estimation. The ARCH/GARCH family is parameter-estimated via MLE on log-likelihood of daily returns.
  • Option-model calibration. Heston, SABR, and Bates parameters are estimated by MLE on historical implied-volatility surfaces.
  • Credit-spread term structure. Merton models, CIR, and Hull-White parameters are MLE-calibrated to observed yield data.
  • Regime-switching models. Hidden Markov model parameters come from EM, which is MLE on augmented data.
  • Kalman filter estimation. The state-space parameters are MLE-estimated on the innovation likelihood.
  • Factor-model identification. Principal-factor, CAPM, and Fama-French parameters are MLE estimates under gaussian errors (equivalent to OLS).
MLE has three attractive properties: consistency (θ^MLEθ0\hat\theta_{\text{MLE}} \to \theta_0), asymptotic normality (n(θ^θ0)N(0,I(θ0)1)\sqrt n(\hat\theta - \theta_0) \Rightarrow \mathcal{N}(0, I(\theta_0)^{-1})), and asymptotic efficiency (saturates the Cramér-Rao lower bound). Under standard regularity, no other estimator can do better asymptotically than the MLE.

The informal idea

Given i.i.d. data x1,,xnx_1, \ldots, x_n from a distribution with density p(x;θ)p(x; \theta), the likelihood function is:
L(θ):=i=1np(xi;θ).L(\theta) := \prod_{i=1}^n p(x_i; \theta).
The likelihood is the probability (or density) of observing the data as a function of the parameter. MLE picks the θ\theta that maximises it — "the value of θ\theta most consistent with the data."
In practice we work with the log-likelihood (θ):=logL(θ)=logp(xi;θ)\ell(\theta) := \log L(\theta) = \sum \log p(x_i; \theta), since products of densities become sums of log-densities and differentiation is much cleaner.

The MLE satisfies:

(θ^)=0,(θ^)<0(at a maximum).\ell'(\hat\theta) = 0, \quad \ell''(\hat\theta) < 0 \quad \text{(at a maximum)}.

Formal definition

Let x1,,xnx_1, \ldots, x_n be i.i.d. samples from a distribution with density (or pmf) p(;θ)p(\cdot; \theta) where θΘRk\theta \in \Theta \subseteq \mathbb{R}^k is unknown.

  • Likelihood function: L:Θ[0,)L: \Theta \to [0, \infty), L(θ):=ip(xi;θ)L(\theta) := \prod_i p(x_i; \theta).
  • Log-likelihood: :ΘR\ell: \Theta \to \mathbb{R}, (θ):=ilogp(xi;θ)\ell(\theta) := \sum_i \log p(x_i; \theta).
  • Maximum likelihood estimator (MLE): θ^MLE:=argmaxθΘ(θ)\hat\theta_{\text{MLE}} := \arg\max_{\theta\in\Theta} \ell(\theta).

Existence and uniqueness of the MLE depend on the model and data; in regular parametric families, the MLE exists a.s. for large nn and is unique under identifiability.

Standard properties under regularity

Under standard regularity conditions (identifiability, smoothness of the likelihood, interior maximum, correctly specified model, etc.):

Property 1 — Consistency

θ^MLEPθ0,\hat\theta_{\text{MLE}} \xrightarrow{\mathbb{P}} \theta_0,

where θ0\theta_0 is the true parameter. As sample size grows, the MLE converges in probability to the truth.

Property 2 — Asymptotic normality

n(θ^MLEθ0)dN(0,I(θ0)1),\sqrt n\,(\hat\theta_{\text{MLE}} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1}),
where I(θ)I(\theta) is the Fisher information matrix (per observation):
I(θ):=Eθ[1(θ)]=Eθ[1(θ)1(θ)],I(\theta) := -\mathbb{E}_\theta[\ell_1''(\theta)] = \mathbb{E}_\theta[\ell_1'(\theta)\ell_1'(\theta)^\top],

with 1(θ):=logp(X1;θ)\ell_1(\theta) := \log p(X_1; \theta) the log-density for a single observation.

Finite-sample practitioner's rule: θ^MLEN(θ0,I(θ^)1/n)\hat\theta_{\text{MLE}} \approx \mathcal{N}(\theta_0, I(\hat\theta)^{-1}/n), so standard errors are SE(θ^j)[I(θ^)1]jj/n\text{SE}(\hat\theta_j) \approx \sqrt{[I(\hat\theta)^{-1}]_{jj}/n}.

Property 3 — Asymptotic efficiency

The asymptotic variance I(θ0)1/nI(\theta_0)^{-1}/n matches the Cramér-Rao lower bound for unbiased estimators. No unbiased estimator can do better asymptotically.

Property 4 — Invariance

For any function gg: if θ^\hat\theta is the MLE of θ\theta, then g(θ^)g(\hat\theta) is the MLE of g(θ)g(\theta). (Cheap but powerful — e.g. if σ^\hat\sigma is the MLE of volatility, then σ^2\hat\sigma^2 is the MLE of variance.)

Canonical examples

Example 1 — Mean of a normal with known variance

X1,,XnX_1, \ldots, X_n i.i.d. N(μ,σ2)\mathcal{N}(\mu, \sigma^2), σ2\sigma^2 known.

(μ)=n2log(2πσ2)12σ2(xiμ)2.\ell(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum (x_i - \mu)^2.

(μ)=(1/σ2)(xiμ)=0μ^=xˉ\ell'(\mu) = (1/\sigma^2)\sum(x_i - \mu) = 0 \Rightarrow \hat\mu = \bar x. Sample mean is the MLE.

Example 2 — Both mean and variance of a normal

XiX_i i.i.d. N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with both unknown. Log-likelihood:

(μ,σ2)=n2log(2πσ2)12σ2(xiμ)2.\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum (x_i - \mu)^2.

Setting /μ=0\partial\ell/\partial\mu = 0: μ^=xˉ\hat\mu = \bar x. Setting /σ2=0\partial\ell/\partial\sigma^2 = 0:

n2σ2+12σ4(xixˉ)2=0σ^2=1n(xixˉ)2.-\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4}\sum(x_i - \bar x)^2 = 0 \Rightarrow \hat\sigma^2 = \frac{1}{n}\sum(x_i - \bar x)^2.
Caveat: σ^2\hat\sigma^2 is biased downward — E[σ^2]=(n1)σ2/n\mathbb{E}[\hat\sigma^2] = (n-1)\sigma^2/n. The unbiased (Bessel-corrected) version divides by n1n - 1; the MLE divides by nn. This is why Bessel's correction is standard in practice: unbiasedness trumps the MLE's invariance in small samples.

Example 3 — Exponential rate

XiX_i i.i.d. Exp(λ)\text{Exp}(\lambda): p(x;λ)=λeλxp(x; \lambda) = \lambda e^{-\lambda x} for x0x \ge 0.

(λ)=nlogλλxi.\ell(\lambda) = n\log\lambda - \lambda\sum x_i.

(λ)=n/λxi=0λ^=n/xi=1/xˉ\ell'(\lambda) = n/\lambda - \sum x_i = 0 \Rightarrow \hat\lambda = n/\sum x_i = 1/\bar x. Reciprocal of the sample mean.

Example 4 — Linear regression is MLE under gaussian errors

With y=Xβ+ϵy = X\beta + \epsilon, ϵN(0,σ2I)\epsilon \sim \mathcal{N}(0, \sigma^2 I):

(β,σ2)=n2log(2πσ2)12σ2yXβ2.\ell(\beta, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\|y - X\beta\|^2.

Maximising over β\beta with σ2\sigma^2 fixed: minimise yXβ2\|y - X\beta\|^2 — OLS. So β^MLE=β^OLS=(XX)1Xy\hat\beta_{\text{MLE}} = \hat\beta_{\text{OLS}} = (X^\top X)^{-1}X^\top y. Under gaussian errors, OLS is the MLE.

Example 5 — GARCH(1,1) via MLE

For rtFt1N(0,σt2)r_t | \mathcal{F}_{t-1} \sim \mathcal{N}(0, \sigma_t^2) with σt2=ω+αrt12+βσt12\sigma_t^2 = \omega + \alpha r_{t-1}^2 + \beta\sigma_{t-1}^2:

(ω,α,β)=t[12logσt2(θ)+rt22σt2(θ)].\ell(\omega, \alpha, \beta) = -\sum_t \left[\tfrac{1}{2}\log\sigma_t^2(\theta) + \tfrac{r_t^2}{2\sigma_t^2(\theta)}\right].

The σt2(θ)\sigma_t^2(\theta) are computed recursively from the data and parameters. MLE is numerical (no closed form) but straightforward with modern optimisers. This is how every commercial risk system fits GARCH models.

Numerical MLE: practical notes

When the MLE has no closed form:

  1. Maximise (θ)\ell(\theta) numerically. Newton-Raphson, BFGS, L-BFGS-B are standard.
  2. Use gradient/Hessian analytically if available. Score (θ)\ell'(\theta) and Hessian (θ)\ell''(\theta) accelerate convergence dramatically.
  3. Start from a good initial guess. Moment matching or profile likelihoods give reasonable starting points.
  4. Check multiple starts. Non-convex likelihoods (e.g. Heston) have local maxima; always try multiple initialisations.
  5. Estimate I(θ^)1I(\hat\theta)^{-1} from the numerical Hessian. Standard errors diag((θ^)1)\sqrt{\text{diag}(-\ell''(\hat\theta)^{-1})}.

Common pitfalls

"MLE is unbiased." No — MLE is only asymptotically unbiased. The MLE for σ2\sigma^2 is biased in finite samples; Bessel's correction fixes it.
"MLE gives the 'true' parameter." It gives the parameter that maximises likelihood under the assumed model. If the model is wrong (misspecified), MLE converges to the "pseudo-true" parameter minimising KL divergence from the true distribution — not necessarily meaningful.
"MLE always exists." Counter-examples: uniform Uniform(0,θ)\text{Uniform}(0, \theta) has MLE θ^=maxixi\hat\theta = \max_i x_i, which is on the boundary (and biased). Some likelihoods are unbounded above — then "MLE" is ill-defined.
"MLE is robust to outliers." Emphatically no. Squared-loss MLE (gaussian errors) is very sensitive to outliers. Robust estimators (M-estimators, Huber loss) trade efficiency for robustness.
"Higher likelihood = better model." Comparing likelihoods across models is only meaningful after penalising for model complexity (AIC, BIC). Overfitting produces arbitrarily high likelihoods on training data.
"Large Fisher information = good." Fisher information I(θ)I(\theta) quantifies how much the data tells you about θ\theta. Higher II \Rightarrow smaller standard errors (inverse Fisher). But II depends on θ\theta itself, so in some parameter regions the estimator is much more precise than in others.

Where this goes next

Exercises

Test your understanding with 3 exercises for this lesson.