CONTENTS

Backtesting Risk Models

Motivation: why this matters in quant finance

A Value-at-Risk number is a prediction. The model says "you will lose more than XX on at most 5%5\% of days." Like any prediction, it deserves to be tested against history. If a 1%1\%-VaR model produces seven exceedances in two years of trading days — far more than the two or three expected — it is miscalibrated and capital is being allocated on fiction.

Backtesting answers one concrete question: given the observed sequence of exceedances (days when realised loss exceeded forecast VaR), is the model plausible? The Basel regulatory framework codifies this with the "traffic light" system — green, yellow, red — that forces banks to hold higher capital multipliers when their VaR models produce too many exceedances. The statistical machinery is binomial-tail tests (Kupiec), independence tests of exceedance timing (Christoffersen), and for Expected Shortfall — which is not elicitable — a different class of score-function-based tests (Acerbi-Székely).

This note builds the three core tests, explains what each catches and what each misses, and closes with the important subtlety that ES requires different backtesting technology.

The informal idea

A VaR model's forecast is a number VaRt\text{VaR}_t produced before trading day tt. At the end of day tt, realised P&L LtL_t is observed. Define the exceedance indicator
It:=1{Lt>VaRt}.I_t := \mathbf{1}\{L_t > \text{VaR}_t\}.

If the model is well-calibrated at confidence level α\alpha, then ItI_t is a Bernoulli random variable with parameter α\alpha, and over TT days the total count N=ItN = \sum I_t is approximately Binomial(T,α)\text{Binomial}(T, \alpha). Three things can fail:

  • Wrong count. NN is too high (model under-forecasts risk) or too low (over-forecasts, wastes capital).
  • Wrong timing. Exceedances cluster in time (after a risk event, more exceedances follow) — the model ignores regime changes.
  • Wrong tail shape. The count is right but individual exceedances are much bigger than the model predicts.

The three tests address these in turn.

Formal definitions

Kupiec's Proportion of Failures (POF) test

Under the null hypothesis that the model is correctly calibrated, NBinomial(T,α)N \sim \text{Binomial}(T, \alpha). Kupiec's test uses the likelihood ratio

LRPOF=2ln[αN(1α)TNπ^N(1π^)TN],π^=N/T.LR_{\text{POF}} = -2\ln\left[\frac{\alpha^N(1-\alpha)^{T-N}}{\hat{\pi}^N(1-\hat{\pi})^{T-N}}\right], \qquad \hat{\pi} = N/T.

Under the null, LRPOFχ12LR_{\text{POF}} \sim \chi^2_1 asymptotically. Reject if LRPOF>χ1,1β2LR_{\text{POF}} > \chi^2_{1, 1-\beta} (e.g., 3.843.84 for β=0.05\beta = 0.05).

Intuition: the test compares the likelihood of the observed data under the claimed α\alpha vs the empirical MLE π^\hat{\pi}. If they are far apart, reject.

Christoffersen's independence test

Even if the marginal hit rate is right, exceedances might cluster — one exceedance makes the next more likely. The test conditions on each day's indicator and compares transition probabilities π01=P(It=1It1=0)\pi_{01} = \mathbb{P}(I_t = 1 \mid I_{t-1} = 0) and π11=P(It=1It1=1)\pi_{11} = \mathbb{P}(I_t = 1 \mid I_{t-1} = 1). Under independence, π01=π11\pi_{01} = \pi_{11}.

Define nijn_{ij} = number of tt with It1=iI_{t-1} = i, It=jI_t = j. The likelihood ratio is

LRind=2ln[(1π^)n00+n10π^n01+n11(1π^01)n00π^01n01(1π^11)n10π^11n11],LR_{\text{ind}} = -2\ln\left[\frac{(1-\hat{\pi})^{n_{00}+n_{10}}\hat{\pi}^{n_{01}+n_{11}}}{(1-\hat{\pi}_{01})^{n_{00}}\hat{\pi}_{01}^{n_{01}}(1-\hat{\pi}_{11})^{n_{10}}\hat{\pi}_{11}^{n_{11}}}\right],

where π^,π^01,π^11\hat{\pi}, \hat{\pi}_{01}, \hat{\pi}_{11} are the empirical estimates. Asymptotically χ12\chi^2_1.

The conditional coverage test combines POF and independence:
LRCC=LRPOF+LRindχ22.LR_{\text{CC}} = LR_{\text{POF}} + LR_{\text{ind}} \sim \chi^2_2.

Basel traffic light

Over a window of T=250T = 250 trading days (one year), the regulator counts exceedances at 1%1\% VaR:

ZoneExceedancesCapital multiplier
Green0-4×1.00\times 1.00 (baseline 3.03.0)
Yellow5-9×1.13\times 1.13 to ×1.27\times 1.27 (multiplier increases with count)
Red10+×1.33\times 1.33 (regulatory review)

At 1%1\% VaR over 250 days, expected exceedances = 2.52.5. Under the Binomial model, P(N10)2.5×105\mathbb{P}(N \ge 10) \approx 2.5 \times 10^{-5} — so red is genuinely severe.

ES is not elicitable

A statistic s^\hat{s} of a distribution is elicitable if there exists a strictly consistent scoring function S(s^,x)S(\hat{s}, x) such that the expected score E[S(s^,X)]\mathbb{E}[S(\hat{s}, X)] is uniquely minimised at s^=s(X)\hat{s} = s(X) (the true statistic).
  • VaR (a quantile) is elicitable. The scoring function is the pinball loss S(q,x)=(α1{x<q})(qx)S(q, x) = (\alpha - \mathbf{1}\{x < q\})(q - x).
  • ES is not elicitable (Gneiting 2011) — no such scoring function exists for ES alone.
Why this matters. Elicitability is what allows "lower score \Rightarrow better forecast" model comparison. ES comparisons require either (i) joint elicitability of (VaR, ES), which Fissler & Ziegel (2016) proved holds, or (ii) indirect tests.

Acerbi-Székely tests for ES

Acerbi & Székely (2014) proposed three backtests for ES. The simplest, Z2Z_2:

Z2:=1Tαt=1TLt1{Lt>VaRt}ESt1.Z_2 := \frac{1}{T\alpha}\sum_{t=1}^T \frac{L_t\,\mathbf{1}\{L_t > \text{VaR}_t\}}{\text{ES}_t} - 1.

Under the null, E[Z2]=0\mathbb{E}[Z_2] = 0. Significant negative values indicate that realised tail losses exceed forecast ES — model is under-calibrated. The distribution of Z2Z_2 under the null is obtained by resampling under the forecast distribution (parametric bootstrap).

This test is practical but statistically weaker than its VaR counterparts: fewer samples (only exceedance days contribute), higher variance, and requires more history for stable results.

Key properties

  • Power vs sample size. With T=250T = 250, α=0.01\alpha = 0.01: expected exceedances = 2.52.5. The Kupiec test has very low power against alternatives like α=0.02\alpha^* = 0.02 (expected 5 exceedances) — the binomial variance swamps the signal. Larger TT or larger α\alpha improves power.
  • Kupiec and Christoffersen are independent. A model can pass one and fail the other: a steady trickle of exceedances with no clustering satisfies independence but may fail Kupiec; a model that overshoots in one crisis year passes Kupiec over a long window but fails independence.
  • Traffic-light is forgiving. The green zone goes up to 4 exceedances at 1% over 250 days. That's π^=1.6%\hat{\pi} = 1.6\%, a 60%60\% deviation from the nominal rate — Kupiec would not reject, but economically the model is clearly off.
  • ES tests are implicitly about distribution shape. Passing VaR tests does not validate ES — the tail beyond VaR is untested. Basel FRTB (post-2016) requires an additional ES-oriented backtest as a supplement.

Worked example — Kupiec POF by hand

A bank runs a 1%1\%-VaR model over T=250T = 250 days and observes N=7N = 7 exceedances.

Expected: E[N]=2.5\mathbb{E}[N] = 2.5. Empirical rate: π^=7/250=0.028\hat{\pi} = 7/250 = 0.028.

LRPOF=2ln[(0.01)7(0.99)243(0.028)7(0.972)243].LR_{\text{POF}} = -2\ln\left[\frac{(0.01)^7(0.99)^{243}}{(0.028)^7(0.972)^{243}}\right].

Numerator: 7ln(0.01)+243ln(0.99)=7(4.6052)+243(0.01005)=32.242.44=34.687\ln(0.01) + 243\ln(0.99) = 7(-4.6052) + 243(-0.01005) = -32.24 - 2.44 = -34.68. Denominator: 7ln(0.028)+243ln(0.972)=7(3.5756)+243(0.02840)=25.036.90=31.937\ln(0.028) + 243\ln(0.972) = 7(-3.5756) + 243(-0.02840) = -25.03 - 6.90 = -31.93.

LRPOF=2(34.68(31.93))=2(2.75)=5.50.LR_{\text{POF}} = -2(-34.68 - (-31.93)) = -2(-2.75) = 5.50.

At χ1,0.952=3.84\chi^2_{1, 0.95} = 3.84: reject at 5%5\%. At χ1,0.992=6.63\chi^2_{1, 0.99} = 6.63: do not reject at 1%1\%.

Verdict. Marginally miscalibrated — pp-value around 2%2\%. The model likely underestimates risk, but the evidence is not overwhelming. Consistent with the Basel yellow zone (5N95 \le N \le 9).

Common confusions and pitfalls

  • Rolling vs fixed windows. Regulators use rolling 250-day windows; a model can look fine on a fixed calendar year and bad on a rolling window ending right after a volatility spike. Both views are legitimate.
  • Clean P&L. The P&L used for backtesting must exclude fees, new trades, and trading desks' intraday P&L — it must be the change in value of yesterday's static book under today's market moves. Mixing in intraday P&L inflates exceedance counts.
  • Multiple testing. If a bank runs 20 VaR models across desks and tests each at 5%5\%, one is expected to fail by chance. Bonferroni correction or hierarchical testing is the statistically clean approach; most banks simply don't correct.
  • "No exceedances for a year" is suspicious. N=0N = 0 in 250250 days at 1%1\% VaR has probability 0.992508%0.99^{250} \approx 8\% — not impossible, but worth investigating whether the model is systematically over-forecasting and wasting capital.
  • ES backtests need bootstrapping. The null distribution of Z2Z_2 depends on the shape of the forecast distribution, not just α\alpha and ES\text{ES}. Pulling a single critical value from a textbook won't work.

Where this goes next

  • Coherent Risk Measures — why ES is the post-FRTB standard and why backtesting it is subtle.
  • Risk Management — the full operational picture around these tests.
  • Basel FRTB and traffic-light mechanics — regulatory extensions.

Exercises

Test your understanding with 3 exercises for this lesson.