Adam Optimizer

Motivation: why this matters in quant finance

Modern quant models are often trained by minimising a loss: forecast error for returns, negative log-likelihood for volatility models, reconstruction loss for autoencoders, or policy loss in reinforcement learning. Plain gradient descent can work, but it is sensitive to feature scaling, noisy gradients, and the learning-rate schedule.

Adam is the default optimiser for many neural-network and machine-learning workflows because it adapts the step size coordinate by coordinate. In quant work, this matters when one parameter controls a slow-moving macro feature and another controls a high-variance intraday signal. A single global learning rate can be too small for one and too large for the other.

Adam is not magic. It is a carefully engineered combination of momentum and RMS-style normalisation. Understanding the update helps you diagnose unstable training rather than treating the optimiser as a black box.

The informal idea

Adam keeps two running summaries of past gradients:

A first moment $m_t$ , the exponentially weighted average gradient. This is momentum: it smooths noisy updates.
A second moment $v_t$ , the exponentially weighted average squared gradient. This estimates coordinate-wise scale.

The update moves in the direction $m_t$ , but divides by $\sqrt{v_t}$ so coordinates with historically large gradients get smaller steps.

Formal definition

Given parameters $\theta_t$ , stochastic gradient $g_t=\nabla_\theta L_t(\theta_t)$ , learning rate $\alpha$ , and decay parameters $\beta_1,\beta_2\in(0,1)$ :

m_t=\beta_1m_{t-1}+(1-\beta_1)g_t

v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2

with elementwise square. Because $m_0=v_0=0$ , the early estimates are biased toward zero. Adam corrects this:

\hat m_t=\frac{m_t}{1-\beta_1^t}, \qquad \hat v_t=\frac{v_t}{1-\beta_2^t}.

The parameter update is

\theta_{t+1}=\theta_t-\alpha\frac{\hat m_t}{\sqrt{\hat v_t}+\epsilon}.

Typical defaults are $\alpha=10^{-3}$ , $\beta_1=0.9$ , $\beta_2=0.999$ , and $\epsilon=10^{-8}$ .

Key properties

Momentum reduces gradient noise

The first moment $m_t$ averages recent gradients, so a single noisy minibatch does not completely determine the update. This is useful in financial data, where labels are noisy and signal-to-noise ratios are low.

Coordinate-wise scaling helps ill-conditioned losses

Dividing by $\sqrt{\hat v_t}$ gives smaller steps in directions with large gradient variance. This is why Adam often trains faster than plain SGD on models with heterogeneous feature scales.

Bias correction matters early

Without the $(1-\beta^t)$ correction, the first few updates are too small because both moment estimates start at zero. The correction makes the early optimiser behaviour match the intended moving averages.

Adam still needs validation

Adaptive steps can overfit and can settle into sharper minima than SGD in some settings. For finance models, always monitor out-of-sample loss, turnover, and realised transaction costs rather than only training loss.

Worked example

The following compact implementation shows the mechanics on a one-dimensional quadratic loss.

import numpy as np

def adam_minimise(theta0: float, n_steps: int = 8) -> float:
    """Minimise f(theta) = (theta - 3)^2 with Adam."""
    alpha, beta1, beta2, eps = 0.2, 0.9, 0.999, 1e-8
    theta = theta0
    m = 0.0
    v = 0.0
    for t in range(1, n_steps + 1):
        grad = 2.0 * (theta - 3.0)
        m = beta1 * m + (1.0 - beta1) * grad
        v = beta2 * v + (1.0 - beta2) * grad**2
        m_hat = m / (1.0 - beta1**t)
        v_hat = v / (1.0 - beta2**t)
        theta -= alpha * m_hat / (np.sqrt(v_hat) + eps)
    return theta

print(f"theta after 8 steps: {adam_minimise(0.0):.4f}")
# theta after 8 steps: 1.5529

The point has moved steadily toward $3$ even though the initial gradient is much larger than later gradients. On higher-dimensional problems this same normalisation happens coordinate by coordinate.

Common confusions and pitfalls

"Adam removes the need to tune the learning rate." No. Adam is less sensitive than plain gradient descent, but

\alpha

still controls stability and generalisation.

"A larger $\beta_2$ is always better." A high

\beta_2

gives a stable variance estimate but reacts slowly when the gradient scale changes. That can be a problem in non-stationary financial data.

"Lower training loss means a better trading model." Optimisation loss is only a proxy. A model can fit historical noise and fail after costs, slippage, and regime shifts.

"Adam is second-order optimisation." It uses squared gradients as a scale estimate, not the Hessian. It is adaptive first-order optimisation.

Where this goes next

Gradient Descent: The base update that Adam modifies with moments.
Linear Regression: A simple model where gradient-based training can be compared with a closed-form solution.
Quasi-Monte Carlo: Another example where variance reduction and numerical stability matter in quant computation.