Importance Sampling

Motivation: why this matters in quant finance

Antithetic and control variates fail when the payoff is concentrated in rare events — deeply OTM options, default probabilities, extreme tail risks. For these problems, plain Monte Carlo wastes most of its samples in regions where the payoff is zero.

Importance sampling solves this by drawing from a different distribution — one that biases samples toward the payoff region — and reweighting via a likelihood ratio. The reweighting keeps the estimator unbiased; the bias of the sampling makes nearly every sample informative.

For OTM options, this can give 1000x to 10⁶x variance reduction — a difference between a 12-hour MC run and 30 seconds. It's the only Monte Carlo technique routinely used for credit risk (rare-default modelling), regulatory tail VaR, and exotic OTM options.

The informal idea

Suppose we want $\theta = \mathbb{E}_p[f(X)] = \int f(x)p(x)dx$ where $p$ is the true density.

Choose another density $q$ such that $q(x) > 0$ wherever $p(x)f(x) \ne 0$ . Then

\theta = \int f(x) p(x) dx = \int f(x) \frac{p(x)}{q(x)} q(x) dx = \mathbb{E}_q\!\left[f(X) \frac{p(X)}{q(X)}\right].

So sample

X \sim q

instead, and weight each sample by

w(X) = p(X)/q(X)

(the likelihood ratio). Estimator:

\hat\theta_{IS} = \frac{1}{N}\sum_{i=1}^N f(X_i) w(X_i), \quad X_i \sim q.

This is unbiased. Its variance is

\text{Var}(\hat\theta_{IS}) = \frac{1}{N}\big(\mathbb{E}_q[f^2 w^2] - \theta^2\big).

The variance can be smaller — or much larger — than plain MC, depending on whether

q

shifts mass toward or away from

f

. The art is choosing

q

well.

Optimal $q$ (the zero-variance limit)

For non-negative

f \ge 0

(option payoffs), the optimal

q

q^*(x) = \frac{f(x)p(x)}{\theta}.

Substituting:

\text{Var}(\hat\theta) = 0

. The catch: the normalisation constant is

\theta

, the very thing we're computing. So

q^*

is unusable directly. But it gives the design principle: $q$ should be proportional to $|f| \cdot p$ .

In practice, choose $q$ in a parametric family, ideally with a closed-form likelihood ratio, and tune parameters to approximate the optimal shape.

Algorithm: Esscher transform for OTM call

Standard MC: draw $Z \sim N(0, 1)$ , transform to $S_T$ . The OTM call payoff is non-zero only when $Z > z^* = (\ln(K/S_0) - (r - \sigma^2/2)T)/(\sigma\sqrt{T})$ , which is a far right-tail event.

Idea: shift the distribution. Sample

\tilde Z \sim N(\mu, 1)

(drift the normal), then weight by

w(\tilde Z) = \frac{\phi(\tilde Z)}{\phi(\tilde Z - \mu)} = \exp(\mu \tilde Z - \mu^2/2 - \mu \tilde Z) \cdot e^{\mu^2/2} = e^{-\mu \tilde Z + \mu^2/2}.

Wait — the cleaner derivation: if $\tilde Z \sim N(\mu, 1)$ and $Z \sim N(0, 1)$ ,

\frac{p(z)}{q(z)} = \frac{\phi(z)}{\phi(z - \mu)} = \exp\!\left(-\frac{z^2}{2} + \frac{(z - \mu)^2}{2}\right) = \exp(-\mu z + \mu^2/2).

So the IS estimator is

\hat C_{IS} = e^{-rT}\frac{1}{N}\sum_i (S_0 e^{(r-\sigma^2/2)T + \sigma\sqrt{T}\tilde Z_i} - K)^+ \cdot e^{-\mu \tilde Z_i + \mu^2/2}.

Choosing $\mu$ to centre $\tilde Z$ around the payoff region ( $\mu \approx z^*$ ) makes most samples ITM, and the variance plummets.

import numpy as np
from scipy.stats import norm

S0, K, T, r, sigma, N = 100, 200, 1, 0.05, 0.2, 100_000

# Plain MC
rng = np.random.default_rng(42)
Z = rng.standard_normal(N)
ST = S0 * np.exp((r - 0.5*sigma**2)*T + sigma*np.sqrt(T)*Z)
X_plain = np.exp(-r*T) * np.maximum(ST - K, 0)
print(f"Plain: {X_plain.mean():.5f} ± {1.96*X_plain.std(ddof=1)/np.sqrt(N):.5f}")
# Plain: 0.00043 ± 0.00029 -- very noisy

# IS with mu = z*
z_star = (np.log(K/S0) - (r - 0.5*sigma**2)*T)/(sigma*np.sqrt(T))
mu = z_star  # shift to bring most samples to the strike
print(f"z* = {z_star:.3f}, mu = {mu:.3f}")
# z* = 3.218, mu = 3.218

Z_tilde = mu + rng.standard_normal(N)
ST_is = S0 * np.exp((r - 0.5*sigma**2)*T + sigma*np.sqrt(T)*Z_tilde)
weights = np.exp(-mu*Z_tilde + 0.5*mu**2)
X_is = np.exp(-r*T) * np.maximum(ST_is - K, 0) * weights
print(f"IS:    {X_is.mean():.5f} ± {1.96*X_is.std(ddof=1)/np.sqrt(N):.5f}")
# IS:    0.00060 ± 0.00001

A $30\times$ standard error reduction; equivalent to $\sim 1000\times$ samples. For deeper OTM strikes, the gain grows further.

Key properties

Unbiased. For any $q$ with appropriate support, $\hat\theta_{IS}$ has $\mathbb{E}_q[\hat\theta_{IS}] = \theta$ .
Variance can be much smaller — or much larger. A bad $q$ (e.g., shift in the wrong direction) can make IS far worse than plain MC. There's a self-correcting check: if some samples have weights $>$ thousands, you've shifted too far.
Likelihood-ratio explosion. When $q$ has lighter tails than $p$ , the likelihood ratio $w = p/q$ has unbounded variance — IS estimator may have infinite variance even with finite $\theta$ . Diagnostic: check that $w$ has bounded variance via $\mathbb{E}_q[w^2] = \mathbb{E}_p[w]$ being finite.
Effective sample size (ESS). $N_{\text{eff}} = (\sum w_i)^2 / \sum w_i^2$ . ESS $\approx N$ means good IS; ESS $\ll N$ means weights are concentrated on a few samples — bad sign.
Optimal vs heuristic. Esscher (exponential tilting) is optimal for log-concave payoffs. For more complex payoffs, parametric families with adaptive tuning (cross-entropy method) are used.
Stacks with other techniques. IS + control variates is common in default modelling. IS + QMC requires careful re-randomisation.

Choosing $q$ : rules of thumb

Concentrate $q$ near the high-payoff region. For an OTM call, shift the normal mean toward $z^*$ .
Don't over-shift. Setting $\mu \gg z^*$ pushes too far; almost all samples are deep ITM with tiny weights, ESS collapses.
Match tails. If $q$ has lighter tails than $p$ , weights diverge. Common safe choice: $q$ in the same family as $p$ , just with different parameters.
Adaptive tuning. Run a small pilot, estimate the optimal $q$ parameters via cross-entropy minimisation, then run the main pass.
For multi-dimensional problems: shift each dimension separately. The product of likelihood ratios across dimensions becomes the joint LR.

Worked example: default probability

Loss $L = \mathbb{1}_{\{X < -B\}}$ for some asset $X \sim N(0, \sigma)$ , large barrier $B$ . True probability: $p = \Phi(-B/\sigma)$ .

For $B/\sigma = 3$ : $p \approx 0.00135$ . Plain MC needs $\sim 10^7$ samples for $\sim 5\%$ relative precision.

IS with shift: sample $\tilde X \sim N(-B, \sigma^2)$ (centre on the barrier). Weight: $\exp(B X / \sigma^2 - B^2/(2\sigma^2))$ . Now half the samples are below the barrier; ESS is large; variance is reduced by $\sim 10^4$ .

Used in credit risk (basket default models), regulatory ES estimation, and rare-event simulation generally.

Common confusions and pitfalls

Forgetting the weights. Easy mistake: sample from $q$ but compute the plain mean of $f(X)$ . Wrong — must multiply by $w(X) = p(X)/q(X)$ .
Numerical instability of weights. Compute $\log w$ and stabilise: $\log w = -\mu z + \mu^2/2$ (no exponentiation until the final weighted sum). For multi-dim, log-sum-exp tricks.
Too-aggressive tilting. Pushes most weight onto a few samples, reduces ESS, increases variance.
Wrong direction. Shifting away from the payoff region makes things worse than plain MC. Always sanity-check on a small pilot.
Path-dependent IS is harder. For an Asian or barrier option, the natural likelihood ratio is over the entire path, with an exponential of an Itô integral — the Girsanov theorem is the analogue of the Esscher transform here. Computationally, this is a path-wise weighting that's a product over time steps; each step needs to be log-summed correctly.

Where this goes next

Quasi-Monte Carlo — orthogonal variance-reduction strategy.
Change of measure — the underlying mechanism (Girsanov) for path-IS.
Cross-entropy method, adaptive IS — extensions for hard-to-tune problems.