Maximum Likelihood Estimation

Motivation: why this matters in quant finance

Maximum Likelihood Estimation (MLE) is the default method for fitting probabilistic models to data in quant finance. Given a parametric model with unknown parameters

\theta

, MLE chooses

\hat\theta

to maximise the probability of the observed data:

\hat\theta_{\text{MLE}} := \arg\max_\theta p(\text{data} \mid \theta).

In quant-finance practice, MLE underlies:

GARCH volatility estimation. The ARCH/GARCH family is parameter-estimated via MLE on log-likelihood of daily returns.
Option-model calibration. Heston, SABR, and Bates parameters are estimated by MLE on historical implied-volatility surfaces.
Credit-spread term structure. Merton models, CIR, and Hull-White parameters are MLE-calibrated to observed yield data.
Regime-switching models. Hidden Markov model parameters come from EM, which is MLE on augmented data.
Kalman filter estimation. The state-space parameters are MLE-estimated on the innovation likelihood.
Factor-model identification. Principal-factor, CAPM, and Fama-French parameters are MLE estimates under gaussian errors (equivalent to OLS).

MLE has three attractive properties: consistency (

\hat\theta_{\text{MLE}} \to \theta_0

), asymptotic normality (

\sqrt n(\hat\theta - \theta_0) \Rightarrow \mathcal{N}(0, I(\theta_0)^{-1})

), and asymptotic efficiency (saturates the Cramér-Rao lower bound). Under standard regularity, no other estimator can do better asymptotically than the MLE.

The informal idea

Given i.i.d. data

x_1, \ldots, x_n

from a distribution with density

p(x; \theta)

, the likelihood function is:

L(\theta) := \prod_{i=1}^n p(x_i; \theta).

The likelihood is the probability (or density) of observing the data as a function of the parameter. MLE picks the

\theta

that maximises it — "the value of

\theta

most consistent with the data."

In practice we work with the log-likelihood

\ell(\theta) := \log L(\theta) = \sum \log p(x_i; \theta)

, since products of densities become sums of log-densities and differentiation is much cleaner.

The MLE satisfies:

\ell'(\hat\theta) = 0, \quad \ell''(\hat\theta) < 0 \quad \text{(at a maximum)}.

Formal definition

Let $x_1, \ldots, x_n$ be i.i.d. samples from a distribution with density (or pmf) $p(\cdot; \theta)$ where $\theta \in \Theta \subseteq \mathbb{R}^k$ is unknown.

Likelihood function: $L: \Theta \to [0, \infty)$ , $L(\theta) := \prod_i p(x_i; \theta)$ .
Log-likelihood: $\ell: \Theta \to \mathbb{R}$ , $\ell(\theta) := \sum_i \log p(x_i; \theta)$ .
Maximum likelihood estimator (MLE): $\hat\theta_{\text{MLE}} := \arg\max_{\theta\in\Theta} \ell(\theta)$ .

Existence and uniqueness of the MLE depend on the model and data; in regular parametric families, the MLE exists a.s. for large $n$ and is unique under identifiability.

Standard properties under regularity

Under standard regularity conditions (identifiability, smoothness of the likelihood, interior maximum, correctly specified model, etc.):

Property 1 — Consistency

\hat\theta_{\text{MLE}} \xrightarrow{\mathbb{P}} \theta_0,

where $\theta_0$ is the true parameter. As sample size grows, the MLE converges in probability to the truth.

Property 2 — Asymptotic normality

\sqrt n\,(\hat\theta_{\text{MLE}} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I(\theta_0)^{-1}),

where

I(\theta)

is the Fisher information matrix (per observation):

I(\theta) := -\mathbb{E}_\theta[\ell_1''(\theta)] = \mathbb{E}_\theta[\ell_1'(\theta)\ell_1'(\theta)^\top],

with $\ell_1(\theta) := \log p(X_1; \theta)$ the log-density for a single observation.

Finite-sample practitioner's rule: $\hat\theta_{\text{MLE}} \approx \mathcal{N}(\theta_0, I(\hat\theta)^{-1}/n)$ , so standard errors are $\text{SE}(\hat\theta_j) \approx \sqrt{[I(\hat\theta)^{-1}]_{jj}/n}$ .

Property 3 — Asymptotic efficiency

The asymptotic variance

I(\theta_0)^{-1}/n

matches the Cramér-Rao lower bound for unbiased estimators. No unbiased estimator can do better asymptotically.

Property 4 — Invariance

For any function $g$ : if $\hat\theta$ is the MLE of $\theta$ , then $g(\hat\theta)$ is the MLE of $g(\theta)$ . (Cheap but powerful — e.g. if $\hat\sigma$ is the MLE of volatility, then $\hat\sigma^2$ is the MLE of variance.)

Canonical examples

Example 1 — Mean of a normal with known variance

$X_1, \ldots, X_n$ i.i.d. $\mathcal{N}(\mu, \sigma^2)$ , $\sigma^2$ known.

\ell(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum (x_i - \mu)^2.

$\ell'(\mu) = (1/\sigma^2)\sum(x_i - \mu) = 0 \Rightarrow \hat\mu = \bar x$ . Sample mean is the MLE.

Example 2 — Both mean and variance of a normal

$X_i$ i.i.d. $\mathcal{N}(\mu, \sigma^2)$ with both unknown. Log-likelihood:

\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum (x_i - \mu)^2.

Setting $\partial\ell/\partial\mu = 0$ : $\hat\mu = \bar x$ . Setting $\partial\ell/\partial\sigma^2 = 0$ :

-\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4}\sum(x_i - \bar x)^2 = 0 \Rightarrow \hat\sigma^2 = \frac{1}{n}\sum(x_i - \bar x)^2.

Caveat:

\hat\sigma^2

is biased downward —

\mathbb{E}[\hat\sigma^2] = (n-1)\sigma^2/n

. The unbiased (Bessel-corrected) version divides by

n - 1

; the MLE divides by

n

. This is why Bessel's correction is standard in practice: unbiasedness trumps the MLE's invariance in small samples.

Example 3 — Exponential rate

$X_i$ i.i.d. $\text{Exp}(\lambda)$ : $p(x; \lambda) = \lambda e^{-\lambda x}$ for $x \ge 0$ .

\ell(\lambda) = n\log\lambda - \lambda\sum x_i.

$\ell'(\lambda) = n/\lambda - \sum x_i = 0 \Rightarrow \hat\lambda = n/\sum x_i = 1/\bar x$ . Reciprocal of the sample mean.

Example 4 — Linear regression is MLE under gaussian errors

With $y = X\beta + \epsilon$ , $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$ :

\ell(\beta, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\|y - X\beta\|^2.

Maximising over $\beta$ with $\sigma^2$ fixed: minimise $\|y - X\beta\|^2$ — OLS. So $\hat\beta_{\text{MLE}} = \hat\beta_{\text{OLS}} = (X^\top X)^{-1}X^\top y$ . Under gaussian errors, OLS is the MLE.

Example 5 — GARCH(1,1) via MLE

For $r_t | \mathcal{F}_{t-1} \sim \mathcal{N}(0, \sigma_t^2)$ with $\sigma_t^2 = \omega + \alpha r_{t-1}^2 + \beta\sigma_{t-1}^2$ :

\ell(\omega, \alpha, \beta) = -\sum_t \left[\tfrac{1}{2}\log\sigma_t^2(\theta) + \tfrac{r_t^2}{2\sigma_t^2(\theta)}\right].

The $\sigma_t^2(\theta)$ are computed recursively from the data and parameters. MLE is numerical (no closed form) but straightforward with modern optimisers. This is how every commercial risk system fits GARCH models.

Numerical MLE: practical notes

When the MLE has no closed form:

Maximise $\ell(\theta)$ numerically. Newton-Raphson, BFGS, L-BFGS-B are standard.
Use gradient/Hessian analytically if available. Score $\ell'(\theta)$ and Hessian $\ell''(\theta)$ accelerate convergence dramatically.
Start from a good initial guess. Moment matching or profile likelihoods give reasonable starting points.
Check multiple starts. Non-convex likelihoods (e.g. Heston) have local maxima; always try multiple initialisations.
Estimate $I(\hat\theta)^{-1}$ from the numerical Hessian. Standard errors $\sqrt{\text{diag}(-\ell''(\hat\theta)^{-1})}$ .

Common pitfalls

"MLE is unbiased." No — MLE is only asymptotically unbiased. The MLE for

\sigma^2

is biased in finite samples; Bessel's correction fixes it.

"MLE gives the 'true' parameter." It gives the parameter that maximises likelihood under the assumed model. If the model is wrong (misspecified), MLE converges to the "pseudo-true" parameter minimising KL divergence from the true distribution — not necessarily meaningful.

"MLE always exists." Counter-examples: uniform

\text{Uniform}(0, \theta)

has MLE

\hat\theta = \max_i x_i

, which is on the boundary (and biased). Some likelihoods are unbounded above — then "MLE" is ill-defined.

"MLE is robust to outliers." Emphatically no. Squared-loss MLE (gaussian errors) is very sensitive to outliers. Robust estimators (M-estimators, Huber loss) trade efficiency for robustness.

"Higher likelihood = better model." Comparing likelihoods across models is only meaningful after penalising for model complexity (AIC, BIC). Overfitting produces arbitrarily high likelihoods on training data.

"Large Fisher information = good." Fisher information

I(\theta)

quantifies how much the data tells you about

\theta

. Higher

I \Rightarrow

smaller standard errors (inverse Fisher). But

I

depends on

\theta

itself, so in some parameter regions the estimator is much more precise than in others.

Where this goes next

Linear Regression Derivation: OLS = MLE under gaussian errors.
Moment Generating Functions: Used in Cramér-Rao bound derivations.
Central Limit Theorem: Powers MLE's asymptotic normality.
Hypothesis Testing: Likelihood-ratio tests are built on MLE comparisons (future lesson).
GARCH Models: The most prominent MLE application in quant finance (future lesson).