Backtesting Risk Models
Motivation: why this matters in quant finance
A Value-at-Risk number is a prediction. The model says "you will lose more than on at most of days." Like any prediction, it deserves to be tested against history. If a -VaR model produces seven exceedances in two years of trading days — far more than the two or three expected — it is miscalibrated and capital is being allocated on fiction.
This note builds the three core tests, explains what each catches and what each misses, and closes with the important subtlety that ES requires different backtesting technology.
The informal idea
If the model is well-calibrated at confidence level , then is a Bernoulli random variable with parameter , and over days the total count is approximately . Three things can fail:
- Wrong count. is too high (model under-forecasts risk) or too low (over-forecasts, wastes capital).
- Wrong timing. Exceedances cluster in time (after a risk event, more exceedances follow) — the model ignores regime changes.
- Wrong tail shape. The count is right but individual exceedances are much bigger than the model predicts.
The three tests address these in turn.
Formal definitions
Kupiec's Proportion of Failures (POF) test
Under the null hypothesis that the model is correctly calibrated, . Kupiec's test uses the likelihood ratio
Under the null, asymptotically. Reject if (e.g., for ).
Intuition: the test compares the likelihood of the observed data under the claimed vs the empirical MLE . If they are far apart, reject.
Christoffersen's independence test
Even if the marginal hit rate is right, exceedances might cluster — one exceedance makes the next more likely. The test conditions on each day's indicator and compares transition probabilities and . Under independence, .
Define = number of with , . The likelihood ratio is
where are the empirical estimates. Asymptotically .
Basel traffic light
Over a window of trading days (one year), the regulator counts exceedances at VaR:
| Zone | Exceedances | Capital multiplier |
|---|---|---|
| Green | 0-4 | (baseline ) |
| Yellow | 5-9 | to (multiplier increases with count) |
| Red | 10+ | (regulatory review) |
At VaR over 250 days, expected exceedances = . Under the Binomial model, — so red is genuinely severe.
ES is not elicitable
- VaR (a quantile) is elicitable. The scoring function is the pinball loss .
- ES is not elicitable (Gneiting 2011) — no such scoring function exists for ES alone.
Acerbi-Székely tests for ES
Acerbi & Székely (2014) proposed three backtests for ES. The simplest, :
Under the null, . Significant negative values indicate that realised tail losses exceed forecast ES — model is under-calibrated. The distribution of under the null is obtained by resampling under the forecast distribution (parametric bootstrap).
This test is practical but statistically weaker than its VaR counterparts: fewer samples (only exceedance days contribute), higher variance, and requires more history for stable results.
Key properties
- Power vs sample size. With , : expected exceedances = . The Kupiec test has very low power against alternatives like (expected 5 exceedances) — the binomial variance swamps the signal. Larger or larger improves power.
- Kupiec and Christoffersen are independent. A model can pass one and fail the other: a steady trickle of exceedances with no clustering satisfies independence but may fail Kupiec; a model that overshoots in one crisis year passes Kupiec over a long window but fails independence.
- Traffic-light is forgiving. The green zone goes up to 4 exceedances at 1% over 250 days. That's , a deviation from the nominal rate — Kupiec would not reject, but economically the model is clearly off.
- ES tests are implicitly about distribution shape. Passing VaR tests does not validate ES — the tail beyond VaR is untested. Basel FRTB (post-2016) requires an additional ES-oriented backtest as a supplement.
Worked example — Kupiec POF by hand
A bank runs a -VaR model over days and observes exceedances.
Expected: . Empirical rate: .
Numerator: . Denominator: .
At : reject at . At : do not reject at .
Common confusions and pitfalls
- Rolling vs fixed windows. Regulators use rolling 250-day windows; a model can look fine on a fixed calendar year and bad on a rolling window ending right after a volatility spike. Both views are legitimate.
- Clean P&L. The P&L used for backtesting must exclude fees, new trades, and trading desks' intraday P&L — it must be the change in value of yesterday's static book under today's market moves. Mixing in intraday P&L inflates exceedance counts.
- Multiple testing. If a bank runs 20 VaR models across desks and tests each at , one is expected to fail by chance. Bonferroni correction or hierarchical testing is the statistically clean approach; most banks simply don't correct.
- "No exceedances for a year" is suspicious. in days at VaR has probability — not impossible, but worth investigating whether the model is systematically over-forecasting and wasting capital.
- ES backtests need bootstrapping. The null distribution of depends on the shape of the forecast distribution, not just and . Pulling a single critical value from a textbook won't work.
Where this goes next
- Coherent Risk Measures — why ES is the post-FRTB standard and why backtesting it is subtle.
- Risk Management — the full operational picture around these tests.
- Basel FRTB and traffic-light mechanics — regulatory extensions.