Maximum Likelihood Estimation (MLE) is the default method for fitting probabilistic models to data in quant finance. Given a parametric model with unknown parameters θ, MLE chooses θ^ to maximise the probability of the observed data:
θ^MLE:=argθmaxp(data∣θ).
In quant-finance practice, MLE underlies:
GARCH volatility estimation. The ARCH/GARCH family is parameter-estimated via MLE on log-likelihood of daily returns.
Option-model calibration. Heston, SABR, and Bates parameters are estimated by MLE on historical implied-volatility surfaces.
Credit-spread term structure. Merton models, CIR, and Hull-White parameters are MLE-calibrated to observed yield data.
Regime-switching models. Hidden Markov model parameters come from EM, which is MLE on augmented data.
Kalman filter estimation. The state-space parameters are MLE-estimated on the innovation likelihood.
Factor-model identification. Principal-factor, CAPM, and Fama-French parameters are MLE estimates under gaussian errors (equivalent to OLS).
MLE has three attractive properties: consistency (θ^MLE→θ0), asymptotic normality (n(θ^−θ0)⇒N(0,I(θ0)−1)), and asymptotic efficiency (saturates the Cramér-Rao lower bound). Under standard regularity, no other estimator can do better asymptotically than the MLE.
The informal idea
Given i.i.d. data x1,…,xn from a distribution with density p(x;θ), the likelihood function is:
L(θ):=i=1∏np(xi;θ).
The likelihood is the probability (or density) of observing the data as a function of the parameter. MLE picks the θ that maximises it — "the value of θ most consistent with the data."
In practice we work with the log-likelihoodℓ(θ):=logL(θ)=∑logp(xi;θ), since products of densities become sums of log-densities and differentiation is much cleaner.
The MLE satisfies:
ℓ′(θ^)=0,ℓ′′(θ^)<0(at a maximum).
Formal definition
Let x1,…,xn be i.i.d. samples from a distribution with density (or pmf) p(⋅;θ) where θ∈Θ⊆Rk is unknown.
Likelihood function:L:Θ→[0,∞), L(θ):=∏ip(xi;θ).
Log-likelihood:ℓ:Θ→R, ℓ(θ):=∑ilogp(xi;θ).
Maximum likelihood estimator (MLE):θ^MLE:=argmaxθ∈Θℓ(θ).
Existence and uniqueness of the MLE depend on the model and data; in regular parametric families, the MLE exists a.s. for large n and is unique under identifiability.
Standard properties under regularity
Under standard regularity conditions (identifiability, smoothness of the likelihood, interior maximum, correctly specified model, etc.):
Property 1 — Consistency
θ^MLEPθ0,
where θ0 is the true parameter. As sample size grows, the MLE converges in probability to the truth.
Property 2 — Asymptotic normality
n(θ^MLE−θ0)dN(0,I(θ0)−1),
where I(θ) is the Fisher information matrix (per observation):
I(θ):=−Eθ[ℓ1′′(θ)]=Eθ[ℓ1′(θ)ℓ1′(θ)⊤],
with ℓ1(θ):=logp(X1;θ) the log-density for a single observation.
Finite-sample practitioner's rule: θ^MLE≈N(θ0,I(θ^)−1/n), so standard errors are SE(θ^j)≈[I(θ^)−1]jj/n.
Property 3 — Asymptotic efficiency
The asymptotic variance I(θ0)−1/n matches the Cramér-Rao lower bound for unbiased estimators. No unbiased estimator can do better asymptotically.
Property 4 — Invariance
For any function g: if θ^ is the MLE of θ, then g(θ^) is the MLE of g(θ). (Cheap but powerful — e.g. if σ^ is the MLE of volatility, then σ^2 is the MLE of variance.)
Canonical examples
Example 1 — Mean of a normal with known variance
X1,…,Xn i.i.d. N(μ,σ2), σ2 known.
ℓ(μ)=−2nlog(2πσ2)−2σ21∑(xi−μ)2.
ℓ′(μ)=(1/σ2)∑(xi−μ)=0⇒μ^=xˉ. Sample mean is the MLE.
Example 2 — Both mean and variance of a normal
Xi i.i.d. N(μ,σ2) with both unknown. Log-likelihood:
ℓ(μ,σ2)=−2nlog(2πσ2)−2σ21∑(xi−μ)2.
Setting ∂ℓ/∂μ=0: μ^=xˉ. Setting ∂ℓ/∂σ2=0:
−2σ2n+2σ41∑(xi−xˉ)2=0⇒σ^2=n1∑(xi−xˉ)2.
Caveat:σ^2 is biased downward — E[σ^2]=(n−1)σ2/n. The unbiased (Bessel-corrected) version divides by n−1; the MLE divides by n. This is why Bessel's correction is standard in practice: unbiasedness trumps the MLE's invariance in small samples.
Example 3 — Exponential rate
Xi i.i.d. Exp(λ): p(x;λ)=λe−λx for x≥0.
ℓ(λ)=nlogλ−λ∑xi.
ℓ′(λ)=n/λ−∑xi=0⇒λ^=n/∑xi=1/xˉ. Reciprocal of the sample mean.
Example 4 — Linear regression is MLE under gaussian errors
With y=Xβ+ϵ, ϵ∼N(0,σ2I):
ℓ(β,σ2)=−2nlog(2πσ2)−2σ21∥y−Xβ∥2.
Maximising over β with σ2 fixed: minimise ∥y−Xβ∥2 — OLS. So β^MLE=β^OLS=(X⊤X)−1X⊤y. Under gaussian errors, OLS is the MLE.
Example 5 — GARCH(1,1) via MLE
For rt∣Ft−1∼N(0,σt2) with σt2=ω+αrt−12+βσt−12:
ℓ(ω,α,β)=−t∑[21logσt2(θ)+2σt2(θ)rt2].
The σt2(θ) are computed recursively from the data and parameters. MLE is numerical (no closed form) but straightforward with modern optimisers. This is how every commercial risk system fits GARCH models.
Numerical MLE: practical notes
When the MLE has no closed form:
Maximise ℓ(θ) numerically. Newton-Raphson, BFGS, L-BFGS-B are standard.
Use gradient/Hessian analytically if available. Score ℓ′(θ) and Hessian ℓ′′(θ) accelerate convergence dramatically.
Start from a good initial guess. Moment matching or profile likelihoods give reasonable starting points.
Check multiple starts. Non-convex likelihoods (e.g. Heston) have local maxima; always try multiple initialisations.
Estimate I(θ^)−1 from the numerical Hessian. Standard errors diag(−ℓ′′(θ^)−1).
Common pitfalls
"MLE is unbiased." No — MLE is only asymptotically unbiased. The MLE for σ2 is biased in finite samples; Bessel's correction fixes it.
"MLE gives the 'true' parameter." It gives the parameter that maximises likelihood under the assumed model. If the model is wrong (misspecified), MLE converges to the "pseudo-true" parameter minimising KL divergence from the true distribution — not necessarily meaningful.
"MLE always exists." Counter-examples: uniform Uniform(0,θ) has MLE θ^=maxixi, which is on the boundary (and biased). Some likelihoods are unbounded above — then "MLE" is ill-defined.
"MLE is robust to outliers." Emphatically no. Squared-loss MLE (gaussian errors) is very sensitive to outliers. Robust estimators (M-estimators, Huber loss) trade efficiency for robustness.
"Higher likelihood = better model." Comparing likelihoods across models is only meaningful after penalising for model complexity (AIC, BIC). Overfitting produces arbitrarily high likelihoods on training data.
"Large Fisher information = good." Fisher information I(θ) quantifies how much the data tells you about θ. Higher I⇒ smaller standard errors (inverse Fisher). But I depends on θ itself, so in some parameter regions the estimator is much more precise than in others.