Adam Optimizer
Motivation: why this matters in quant finance
Adam is the default optimiser for many neural-network and machine-learning workflows because it adapts the step size coordinate by coordinate. In quant work, this matters when one parameter controls a slow-moving macro feature and another controls a high-variance intraday signal. A single global learning rate can be too small for one and too large for the other.
Adam is not magic. It is a carefully engineered combination of momentum and RMS-style normalisation. Understanding the update helps you diagnose unstable training rather than treating the optimiser as a black box.
The informal idea
Adam keeps two running summaries of past gradients:
- A first moment , the exponentially weighted average gradient. This is momentum: it smooths noisy updates.
- A second moment , the exponentially weighted average squared gradient. This estimates coordinate-wise scale.
The update moves in the direction , but divides by so coordinates with historically large gradients get smaller steps.
Formal definition
Given parameters , stochastic gradient , learning rate , and decay parameters :
with elementwise square. Because , the early estimates are biased toward zero. Adam corrects this:
The parameter update is
Typical defaults are , , , and .
Key properties
Momentum reduces gradient noise
The first moment averages recent gradients, so a single noisy minibatch does not completely determine the update. This is useful in financial data, where labels are noisy and signal-to-noise ratios are low.
Coordinate-wise scaling helps ill-conditioned losses
Dividing by gives smaller steps in directions with large gradient variance. This is why Adam often trains faster than plain SGD on models with heterogeneous feature scales.
Bias correction matters early
Without the correction, the first few updates are too small because both moment estimates start at zero. The correction makes the early optimiser behaviour match the intended moving averages.
Adam still needs validation
Adaptive steps can overfit and can settle into sharper minima than SGD in some settings. For finance models, always monitor out-of-sample loss, turnover, and realised transaction costs rather than only training loss.
Worked example
The following compact implementation shows the mechanics on a one-dimensional quadratic loss.
import numpy as np
def adam_minimise(theta0: float, n_steps: int = 8) -> float:
"""Minimise f(theta) = (theta - 3)^2 with Adam."""
alpha, beta1, beta2, eps = 0.2, 0.9, 0.999, 1e-8
theta = theta0
m = 0.0
v = 0.0
for t in range(1, n_steps + 1):
grad = 2.0 * (theta - 3.0)
m = beta1 * m + (1.0 - beta1) * grad
v = beta2 * v + (1.0 - beta2) * grad**2
m_hat = m / (1.0 - beta1**t)
v_hat = v / (1.0 - beta2**t)
theta -= alpha * m_hat / (np.sqrt(v_hat) + eps)
return theta
print(f"theta after 8 steps: {adam_minimise(0.0):.4f}")
# theta after 8 steps: 1.5529The point has moved steadily toward even though the initial gradient is much larger than later gradients. On higher-dimensional problems this same normalisation happens coordinate by coordinate.
Common confusions and pitfalls
Where this goes next
- Gradient Descent: The base update that Adam modifies with moments.
- Linear Regression: A simple model where gradient-based training can be compared with a closed-form solution.
- Quasi-Monte Carlo: Another example where variance reduction and numerical stability matter in quant computation.