CONTENTS

Importance Sampling

Motivation: why this matters in quant finance

Antithetic and control variates fail when the payoff is concentrated in rare events — deeply OTM options, default probabilities, extreme tail risks. For these problems, plain Monte Carlo wastes most of its samples in regions where the payoff is zero.

Importance sampling solves this by drawing from a different distribution — one that biases samples toward the payoff region — and reweighting via a likelihood ratio. The reweighting keeps the estimator unbiased; the bias of the sampling makes nearly every sample informative.
For OTM options, this can give 1000x to 10⁶x variance reduction — a difference between a 12-hour MC run and 30 seconds. It's the only Monte Carlo technique routinely used for credit risk (rare-default modelling), regulatory tail VaR, and exotic OTM options.

The informal idea

Suppose we want θ=Ep[f(X)]=f(x)p(x)dx\theta = \mathbb{E}_p[f(X)] = \int f(x)p(x)dx where pp is the true density.

Choose another density qq such that q(x)>0q(x) > 0 wherever p(x)f(x)0p(x)f(x) \ne 0. Then

θ=f(x)p(x)dx=f(x)p(x)q(x)q(x)dx=Eq ⁣[f(X)p(X)q(X)].\theta = \int f(x) p(x) dx = \int f(x) \frac{p(x)}{q(x)} q(x) dx = \mathbb{E}_q\!\left[f(X) \frac{p(X)}{q(X)}\right].
So sample XqX \sim q instead, and weight each sample by w(X)=p(X)/q(X)w(X) = p(X)/q(X) (the likelihood ratio). Estimator:
θ^IS=1Ni=1Nf(Xi)w(Xi),Xiq.\hat\theta_{IS} = \frac{1}{N}\sum_{i=1}^N f(X_i) w(X_i), \quad X_i \sim q.

This is unbiased. Its variance is

Var(θ^IS)=1N(Eq[f2w2]θ2).\text{Var}(\hat\theta_{IS}) = \frac{1}{N}\big(\mathbb{E}_q[f^2 w^2] - \theta^2\big).
The variance can be smaller — or much larger — than plain MC, depending on whether qq shifts mass toward or away from ff. The art is choosing qq well.

Optimal qq (the zero-variance limit)

For non-negative f0f \ge 0 (option payoffs), the optimal qq is
q(x)=f(x)p(x)θ.q^*(x) = \frac{f(x)p(x)}{\theta}.
Substituting: Var(θ^)=0\text{Var}(\hat\theta) = 0. The catch: the normalisation constant is θ\theta, the very thing we're computing. So qq^* is unusable directly. But it gives the design principle: qq should be proportional to fp|f| \cdot p.

In practice, choose qq in a parametric family, ideally with a closed-form likelihood ratio, and tune parameters to approximate the optimal shape.

Algorithm: Esscher transform for OTM call

Standard MC: draw ZN(0,1)Z \sim N(0, 1), transform to STS_T. The OTM call payoff is non-zero only when Z>z=(ln(K/S0)(rσ2/2)T)/(σT)Z > z^* = (\ln(K/S_0) - (r - \sigma^2/2)T)/(\sigma\sqrt{T}), which is a far right-tail event.

Idea: shift the distribution. Sample Z~N(μ,1)\tilde Z \sim N(\mu, 1) (drift the normal), then weight by
w(Z~)=ϕ(Z~)ϕ(Z~μ)=exp(μZ~μ2/2μZ~)eμ2/2=eμZ~+μ2/2.w(\tilde Z) = \frac{\phi(\tilde Z)}{\phi(\tilde Z - \mu)} = \exp(\mu \tilde Z - \mu^2/2 - \mu \tilde Z) \cdot e^{\mu^2/2} = e^{-\mu \tilde Z + \mu^2/2}.

Wait — the cleaner derivation: if Z~N(μ,1)\tilde Z \sim N(\mu, 1) and ZN(0,1)Z \sim N(0, 1),

p(z)q(z)=ϕ(z)ϕ(zμ)=exp ⁣(z22+(zμ)22)=exp(μz+μ2/2).\frac{p(z)}{q(z)} = \frac{\phi(z)}{\phi(z - \mu)} = \exp\!\left(-\frac{z^2}{2} + \frac{(z - \mu)^2}{2}\right) = \exp(-\mu z + \mu^2/2).

So the IS estimator is

C^IS=erT1Ni(S0e(rσ2/2)T+σTZ~iK)+eμZ~i+μ2/2.\hat C_{IS} = e^{-rT}\frac{1}{N}\sum_i (S_0 e^{(r-\sigma^2/2)T + \sigma\sqrt{T}\tilde Z_i} - K)^+ \cdot e^{-\mu \tilde Z_i + \mu^2/2}.

Choosing μ\mu to centre Z~\tilde Z around the payoff region (μz\mu \approx z^*) makes most samples ITM, and the variance plummets.

import numpy as np from scipy.stats import norm S0, K, T, r, sigma, N = 100, 200, 1, 0.05, 0.2, 100_000 # Plain MC rng = np.random.default_rng(42) Z = rng.standard_normal(N) ST = S0 * np.exp((r - 0.5*sigma**2)*T + sigma*np.sqrt(T)*Z) X_plain = np.exp(-r*T) * np.maximum(ST - K, 0) print(f"Plain: {X_plain.mean():.5f} ± {1.96*X_plain.std(ddof=1)/np.sqrt(N):.5f}") # Plain: 0.00043 ± 0.00029 -- very noisy # IS with mu = z* z_star = (np.log(K/S0) - (r - 0.5*sigma**2)*T)/(sigma*np.sqrt(T)) mu = z_star # shift to bring most samples to the strike print(f"z* = {z_star:.3f}, mu = {mu:.3f}") # z* = 3.218, mu = 3.218 Z_tilde = mu + rng.standard_normal(N) ST_is = S0 * np.exp((r - 0.5*sigma**2)*T + sigma*np.sqrt(T)*Z_tilde) weights = np.exp(-mu*Z_tilde + 0.5*mu**2) X_is = np.exp(-r*T) * np.maximum(ST_is - K, 0) * weights print(f"IS: {X_is.mean():.5f} ± {1.96*X_is.std(ddof=1)/np.sqrt(N):.5f}") # IS: 0.00060 ± 0.00001

A 30×30\times standard error reduction; equivalent to 1000×\sim 1000\times samples. For deeper OTM strikes, the gain grows further.

Key properties

  • Unbiased. For any qq with appropriate support, θ^IS\hat\theta_{IS} has Eq[θ^IS]=θ\mathbb{E}_q[\hat\theta_{IS}] = \theta.
  • Variance can be much smaller — or much larger. A bad qq (e.g., shift in the wrong direction) can make IS far worse than plain MC. There's a self-correcting check: if some samples have weights >> thousands, you've shifted too far.
  • Likelihood-ratio explosion. When qq has lighter tails than pp, the likelihood ratio w=p/qw = p/q has unbounded variance — IS estimator may have infinite variance even with finite θ\theta. Diagnostic: check that ww has bounded variance via Eq[w2]=Ep[w]\mathbb{E}_q[w^2] = \mathbb{E}_p[w] being finite.
  • Effective sample size (ESS). Neff=(wi)2/wi2N_{\text{eff}} = (\sum w_i)^2 / \sum w_i^2. ESS N\approx N means good IS; ESS N\ll N means weights are concentrated on a few samples — bad sign.
  • Optimal vs heuristic. Esscher (exponential tilting) is optimal for log-concave payoffs. For more complex payoffs, parametric families with adaptive tuning (cross-entropy method) are used.
  • Stacks with other techniques. IS + control variates is common in default modelling. IS + QMC requires careful re-randomisation.

Choosing qq: rules of thumb

  1. Concentrate qq near the high-payoff region. For an OTM call, shift the normal mean toward zz^*.
  2. Don't over-shift. Setting μz\mu \gg z^* pushes too far; almost all samples are deep ITM with tiny weights, ESS collapses.
  3. Match tails. If qq has lighter tails than pp, weights diverge. Common safe choice: qq in the same family as pp, just with different parameters.
  4. Adaptive tuning. Run a small pilot, estimate the optimal qq parameters via cross-entropy minimisation, then run the main pass.
  5. For multi-dimensional problems: shift each dimension separately. The product of likelihood ratios across dimensions becomes the joint LR.

Worked example: default probability

Loss L=1{X<B}L = \mathbb{1}_{\{X < -B\}} for some asset XN(0,σ)X \sim N(0, \sigma), large barrier BB. True probability: p=Φ(B/σ)p = \Phi(-B/\sigma).

For B/σ=3B/\sigma = 3: p0.00135p \approx 0.00135. Plain MC needs 107\sim 10^7 samples for 5%\sim 5\% relative precision.

IS with shift: sample X~N(B,σ2)\tilde X \sim N(-B, \sigma^2) (centre on the barrier). Weight: exp(BX/σ2B2/(2σ2))\exp(B X / \sigma^2 - B^2/(2\sigma^2)). Now half the samples are below the barrier; ESS is large; variance is reduced by 104\sim 10^4.

Used in credit risk (basket default models), regulatory ES estimation, and rare-event simulation generally.

Common confusions and pitfalls

  • Forgetting the weights. Easy mistake: sample from qq but compute the plain mean of f(X)f(X). Wrong — must multiply by w(X)=p(X)/q(X)w(X) = p(X)/q(X).
  • Numerical instability of weights. Compute logw\log w and stabilise: logw=μz+μ2/2\log w = -\mu z + \mu^2/2 (no exponentiation until the final weighted sum). For multi-dim, log-sum-exp tricks.
  • Too-aggressive tilting. Pushes most weight onto a few samples, reduces ESS, increases variance.
  • Wrong direction. Shifting away from the payoff region makes things worse than plain MC. Always sanity-check on a small pilot.
  • Path-dependent IS is harder. For an Asian or barrier option, the natural likelihood ratio is over the entire path, with an exponential of an Itô integral — the Girsanov theorem is the analogue of the Esscher transform here. Computationally, this is a path-wise weighting that's a product over time steps; each step needs to be log-summed correctly.

Where this goes next

  • Quasi-Monte Carlo — orthogonal variance-reduction strategy.
  • Change of measure — the underlying mechanism (Girsanov) for path-IS.
  • Cross-entropy method, adaptive IS — extensions for hard-to-tune problems.

Exercises

Test your understanding with 3 exercises for this lesson.