Neural Networks from Scratch

Motivation: why this matters in quant finance

Neural networks are differentiable function approximators. In quant finance they appear in volatility-surface smoothing, nonlinear factor models, surrogate pricing functions, execution models, and regime classifiers. The library call is easy; the useful understanding is how affine maps, activations, losses, gradients, and updates fit together.

This lesson is intentionally from scratch. It does not replace Keras or PyTorch. It gives the NumPy-level mechanics so later library code feels inspectable rather than magical.

The informal idea

A dense neural network alternates linear transformations with nonlinear activations. The first layer builds hidden features. The output layer turns hidden features into predictions. Training repeats four steps: forward pass, loss computation, backward pass, parameter update.

Without nonlinear activations, stacking layers collapses to one Linear Regression model.

Formal statement

For a one-hidden-layer regression network,

\begin{aligned} \mathbf{H} &= \tanh(\mathbf{X}\mathbf{W}_1+\mathbf{b}_1),\\\\ \hat{\mathbf{y}} &= \mathbf{H}\mathbf{W}_2 + b_2. \end{aligned}

With mean squared error,

L(\theta)=\frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2.

Backpropagation applies the chain rule from the loss back through the output layer, activation, and first affine map. Gradient Descent updates each parameter in the negative-gradient direction.

Implementation

import numpy as np

class TinyMLP:
    """One-hidden-layer neural network for regression."""
    def __init__(self, n_features: int, n_hidden: int = 8, learning_rate: float = 0.05, seed: int = 37):
        rng = np.random.default_rng(seed)
        self.W1 = rng.normal(scale=0.3, size=(n_features, n_hidden))
        self.b1 = np.zeros(n_hidden)
        self.W2 = rng.normal(scale=0.3, size=(n_hidden, 1))
        self.b2 = np.zeros(1)
        self.learning_rate = learning_rate

    def fit(self, X: np.ndarray, y: np.ndarray, n_iter: int = 2_000):
        y = y.reshape(-1, 1)
        for _ in range(n_iter):
            H = np.tanh(X @ self.W1 + self.b1)
            y_hat = H @ self.W2 + self.b2
            d_y_hat = 2 * (y_hat - y) / len(y)
            d_W2 = H.T @ d_y_hat
            d_b2 = d_y_hat.sum(axis=0)
            d_H = d_y_hat @ self.W2.T
            d_Z1 = d_H * (1 - H**2)
            self.W2 -= self.learning_rate * d_W2
            self.b2 -= self.learning_rate * d_b2
            self.W1 -= self.learning_rate * (X.T @ d_Z1)
            self.b1 -= self.learning_rate * d_Z1.sum(axis=0)
        return self

    def predict(self, X: np.ndarray) -> np.ndarray:
        H = np.tanh(X @ self.W1 + self.b1)
        return (H @ self.W2 + self.b2).ravel()

rng = np.random.default_rng(37)
X = rng.uniform(-2, 2, size=(200, 2))
y = np.sin(X[:, 0]) + 0.3 * X[:, 1] ** 2
model = TinyMLP(n_features=2).fit(X, y)
print(np.round(model.predict(X[:3]), 3))
# [ 0.85  -0.125 -0.752]

Real projects use automatic differentiation, batching, regularisation, and better optimisers such as Adam. The point here is to see what those tools automate.

Key properties and trade-offs

Property	Meaning	Finance consequence
Compositional	Layers build nonlinear features.	Useful for nonlinear surfaces and interaction-heavy signals.
Differentiable	Training depends on gradients through the graph.	Smooth losses and activations make optimisation feasible.
Data-hungry	Flexibility can overfit small datasets.	Chronological validation is essential.
Harder to interpret	Parameters are not factor betas.	Benchmark against simpler models.

Common confusions and pitfalls

"A neural network is always better." On structured finance data, Random Forest or regularised linear models may win.

"Backpropagation is mysterious." It is the chain rule applied efficiently to a computation graph.

"Training loss is model quality." A network can learn noise while training loss falls.

Where this goes next

Gradient Descent: supplies the optimisation loop.
Adam: improves noisy gradient updates with adaptive moments.
Regularisation: L1 vs L2: introduces weight penalties.
Cross-Validation: keeps model selection outside the training loop.

References

Aurelien Geron (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly. Ch. 10 (Introduction to Artificial Neural Networks with Keras).
Andrew Ng and Tengyu Ma (2023). CS229 Lecture Notes. Ch. 7 (Deep Learning and Backpropagation).
Francois Chollet (2021). Deep Learning with Python (2nd ed.). Manning. Ch. 2 (The Mathematical Building Blocks of Neural Networks).