Backtesting Risk Models

Motivation: why this matters in quant finance

A Value-at-Risk number is a prediction. The model says "you will lose more than $X$ on at most $5\%$ of days." Like any prediction, it deserves to be tested against history. If a $1\%$ -VaR model produces seven exceedances in two years of trading days — far more than the two or three expected — it is miscalibrated and capital is being allocated on fiction.

Backtesting answers one concrete question: given the observed sequence of exceedances (days when realised loss exceeded forecast VaR), is the model plausible? The Basel regulatory framework codifies this with the "traffic light" system — green, yellow, red — that forces banks to hold higher capital multipliers when their VaR models produce too many exceedances. The statistical machinery is binomial-tail tests (Kupiec), independence tests of exceedance timing (Christoffersen), and for Expected Shortfall — which is not elicitable — a different class of score-function-based tests (Acerbi-Székely).

This note builds the three core tests, explains what each catches and what each misses, and closes with the important subtlety that ES requires different backtesting technology.

The informal idea

A VaR model's forecast is a number

\text{VaR}_t

produced before trading day

t

. At the end of day

t

, realised P&L

L_t

is observed. Define the exceedance indicator

I_t := \mathbf{1}\{L_t > \text{VaR}_t\}.

If the model is well-calibrated at confidence level $\alpha$ , then $I_t$ is a Bernoulli random variable with parameter $\alpha$ , and over $T$ days the total count $N = \sum I_t$ is approximately $\text{Binomial}(T, \alpha)$ . Three things can fail:

Wrong count. $N$ is too high (model under-forecasts risk) or too low (over-forecasts, wastes capital).
Wrong timing. Exceedances cluster in time (after a risk event, more exceedances follow) — the model ignores regime changes.
Wrong tail shape. The count is right but individual exceedances are much bigger than the model predicts.

The three tests address these in turn.

Formal definitions

Kupiec's Proportion of Failures (POF) test

Under the null hypothesis that the model is correctly calibrated, $N \sim \text{Binomial}(T, \alpha)$ . Kupiec's test uses the likelihood ratio

LR_{\text{POF}} = -2\ln\left[\frac{\alpha^N(1-\alpha)^{T-N}}{\hat{\pi}^N(1-\hat{\pi})^{T-N}}\right], \qquad \hat{\pi} = N/T.

Under the null, $LR_{\text{POF}} \sim \chi^2_1$ asymptotically. Reject if $LR_{\text{POF}} > \chi^2_{1, 1-\beta}$ (e.g., $3.84$ for $\beta = 0.05$ ).

Intuition: the test compares the likelihood of the observed data under the claimed $\alpha$ vs the empirical MLE $\hat{\pi}$ . If they are far apart, reject.

Christoffersen's independence test

Even if the marginal hit rate is right, exceedances might cluster — one exceedance makes the next more likely. The test conditions on each day's indicator and compares transition probabilities $\pi_{01} = \mathbb{P}(I_t = 1 \mid I_{t-1} = 0)$ and $\pi_{11} = \mathbb{P}(I_t = 1 \mid I_{t-1} = 1)$ . Under independence, $\pi_{01} = \pi_{11}$ .

Define $n_{ij}$ = number of $t$ with $I_{t-1} = i$ , $I_t = j$ . The likelihood ratio is

LR_{\text{ind}} = -2\ln\left[\frac{(1-\hat{\pi})^{n_{00}+n_{10}}\hat{\pi}^{n_{01}+n_{11}}}{(1-\hat{\pi}_{01})^{n_{00}}\hat{\pi}_{01}^{n_{01}}(1-\hat{\pi}_{11})^{n_{10}}\hat{\pi}_{11}^{n_{11}}}\right],

where $\hat{\pi}, \hat{\pi}_{01}, \hat{\pi}_{11}$ are the empirical estimates. Asymptotically $\chi^2_1$ .

The conditional coverage test combines POF and independence:

LR_{\text{CC}} = LR_{\text{POF}} + LR_{\text{ind}} \sim \chi^2_2.

Basel traffic light

Over a window of $T = 250$ trading days (one year), the regulator counts exceedances at $1\%$ VaR:

Zone	Exceedances	Capital multiplier
Green	0-4	$\times 1.00$ (baseline $3.0$ )
Yellow	5-9	$\times 1.13$ to $\times 1.27$ (multiplier increases with count)
Red	10+	$\times 1.33$ (regulatory review)

At $1\%$ VaR over 250 days, expected exceedances = $2.5$ . Under the Binomial model, $\mathbb{P}(N \ge 10) \approx 2.5 \times 10^{-5}$ — so red is genuinely severe.

ES is not elicitable

A statistic

\hat{s}

of a distribution is elicitable if there exists a strictly consistent scoring function

S(\hat{s}, x)

such that the expected score

\mathbb{E}[S(\hat{s}, X)]

is uniquely minimised at

\hat{s} = s(X)

(the true statistic).

VaR (a quantile) is elicitable. The scoring function is the pinball loss $S(q, x) = (\alpha - \mathbf{1}\{x < q\})(q - x)$ .
ES is not elicitable (Gneiting 2011) — no such scoring function exists for ES alone.

Why this matters. Elicitability is what allows "lower score

\Rightarrow

better forecast" model comparison. ES comparisons require either (i) joint elicitability of (VaR, ES), which Fissler & Ziegel (2016) proved holds, or (ii) indirect tests.

Acerbi-Székely tests for ES

Acerbi & Székely (2014) proposed three backtests for ES. The simplest, $Z_2$ :

Z_2 := \frac{1}{T\alpha}\sum_{t=1}^T \frac{L_t\,\mathbf{1}\{L_t > \text{VaR}_t\}}{\text{ES}_t} - 1.

Under the null, $\mathbb{E}[Z_2] = 0$ . Significant negative values indicate that realised tail losses exceed forecast ES — model is under-calibrated. The distribution of $Z_2$ under the null is obtained by resampling under the forecast distribution (parametric bootstrap).

This test is practical but statistically weaker than its VaR counterparts: fewer samples (only exceedance days contribute), higher variance, and requires more history for stable results.

Key properties

Power vs sample size. With $T = 250$ , $\alpha = 0.01$ : expected exceedances = $2.5$ . The Kupiec test has very low power against alternatives like $\alpha^* = 0.02$ (expected 5 exceedances) — the binomial variance swamps the signal. Larger $T$ or larger $\alpha$ improves power.
Kupiec and Christoffersen are independent. A model can pass one and fail the other: a steady trickle of exceedances with no clustering satisfies independence but may fail Kupiec; a model that overshoots in one crisis year passes Kupiec over a long window but fails independence.
Traffic-light is forgiving. The green zone goes up to 4 exceedances at 1% over 250 days. That's $\hat{\pi} = 1.6\%$ , a $60\%$ deviation from the nominal rate — Kupiec would not reject, but economically the model is clearly off.
ES tests are implicitly about distribution shape. Passing VaR tests does not validate ES — the tail beyond VaR is untested. Basel FRTB (post-2016) requires an additional ES-oriented backtest as a supplement.

Worked example — Kupiec POF by hand

A bank runs a $1\%$ -VaR model over $T = 250$ days and observes $N = 7$ exceedances.

Expected: $\mathbb{E}[N] = 2.5$ . Empirical rate: $\hat{\pi} = 7/250 = 0.028$ .

LR_{\text{POF}} = -2\ln\left[\frac{(0.01)^7(0.99)^{243}}{(0.028)^7(0.972)^{243}}\right].

Numerator: $7\ln(0.01) + 243\ln(0.99) = 7(-4.6052) + 243(-0.01005) = -32.24 - 2.44 = -34.68$ . Denominator: $7\ln(0.028) + 243\ln(0.972) = 7(-3.5756) + 243(-0.02840) = -25.03 - 6.90 = -31.93$ .

$LR_{\text{POF}} = -2(-34.68 - (-31.93)) = -2(-2.75) = 5.50.$

At $\chi^2_{1, 0.95} = 3.84$ : reject at $5\%$ . At $\chi^2_{1, 0.99} = 6.63$ : do not reject at $1\%$ .

Verdict. Marginally miscalibrated —

p

-value around

2\%

. The model likely underestimates risk, but the evidence is not overwhelming. Consistent with the Basel yellow zone (

5 \le N \le 9

Common confusions and pitfalls

Rolling vs fixed windows. Regulators use rolling 250-day windows; a model can look fine on a fixed calendar year and bad on a rolling window ending right after a volatility spike. Both views are legitimate.
Clean P&L. The P&L used for backtesting must exclude fees, new trades, and trading desks' intraday P&L — it must be the change in value of yesterday's static book under today's market moves. Mixing in intraday P&L inflates exceedance counts.
Multiple testing. If a bank runs 20 VaR models across desks and tests each at $5\%$ , one is expected to fail by chance. Bonferroni correction or hierarchical testing is the statistically clean approach; most banks simply don't correct.
"No exceedances for a year" is suspicious. $N = 0$ in $250$ days at $1\%$ VaR has probability $0.99^{250} \approx 8\%$ — not impossible, but worth investigating whether the model is systematically over-forecasting and wasting capital.
ES backtests need bootstrapping. The null distribution of $Z_2$ depends on the shape of the forecast distribution, not just $\alpha$ and $\text{ES}$ . Pulling a single critical value from a textbook won't work.

Where this goes next

Coherent Risk Measures — why ES is the post-FRTB standard and why backtesting it is subtle.
Risk Management — the full operational picture around these tests.
Basel FRTB and traffic-light mechanics — regulatory extensions.