Neural Networks from Scratch
Motivation: why this matters in quant finance
Neural networks are differentiable function approximators. In quant finance they appear in volatility-surface smoothing, nonlinear factor models, surrogate pricing functions, execution models, and regime classifiers. The library call is easy; the useful understanding is how affine maps, activations, losses, gradients, and updates fit together.
This lesson is intentionally from scratch. It does not replace Keras or PyTorch. It gives the NumPy-level mechanics so later library code feels inspectable rather than magical.
The informal idea
A dense neural network alternates linear transformations with nonlinear activations. The first layer builds hidden features. The output layer turns hidden features into predictions. Training repeats four steps: forward pass, loss computation, backward pass, parameter update.
Formal statement
For a one-hidden-layer regression network,
With mean squared error,
Implementation
import numpy as np
class TinyMLP:
"""One-hidden-layer neural network for regression."""
def __init__(self, n_features: int, n_hidden: int = 8, learning_rate: float = 0.05, seed: int = 37):
rng = np.random.default_rng(seed)
self.W1 = rng.normal(scale=0.3, size=(n_features, n_hidden))
self.b1 = np.zeros(n_hidden)
self.W2 = rng.normal(scale=0.3, size=(n_hidden, 1))
self.b2 = np.zeros(1)
self.learning_rate = learning_rate
def fit(self, X: np.ndarray, y: np.ndarray, n_iter: int = 2_000):
y = y.reshape(-1, 1)
for _ in range(n_iter):
H = np.tanh(X @ self.W1 + self.b1)
y_hat = H @ self.W2 + self.b2
d_y_hat = 2 * (y_hat - y) / len(y)
d_W2 = H.T @ d_y_hat
d_b2 = d_y_hat.sum(axis=0)
d_H = d_y_hat @ self.W2.T
d_Z1 = d_H * (1 - H**2)
self.W2 -= self.learning_rate * d_W2
self.b2 -= self.learning_rate * d_b2
self.W1 -= self.learning_rate * (X.T @ d_Z1)
self.b1 -= self.learning_rate * d_Z1.sum(axis=0)
return self
def predict(self, X: np.ndarray) -> np.ndarray:
H = np.tanh(X @ self.W1 + self.b1)
return (H @ self.W2 + self.b2).ravel()
rng = np.random.default_rng(37)
X = rng.uniform(-2, 2, size=(200, 2))
y = np.sin(X[:, 0]) + 0.3 * X[:, 1] ** 2
model = TinyMLP(n_features=2).fit(X, y)
print(np.round(model.predict(X[:3]), 3))
# [ 0.85 -0.125 -0.752]Key properties and trade-offs
| Property | Meaning | Finance consequence |
|---|---|---|
| Compositional | Layers build nonlinear features. | Useful for nonlinear surfaces and interaction-heavy signals. |
| Differentiable | Training depends on gradients through the graph. | Smooth losses and activations make optimisation feasible. |
| Data-hungry | Flexibility can overfit small datasets. | Chronological validation is essential. |
| Harder to interpret | Parameters are not factor betas. | Benchmark against simpler models. |
Common confusions and pitfalls
Where this goes next
- Gradient Descent: supplies the optimisation loop.
- Adam: improves noisy gradient updates with adaptive moments.
- Regularisation: L1 vs L2: introduces weight penalties.
- Cross-Validation: keeps model selection outside the training loop.
References
- Aurelien Geron (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly. Ch. 10 (Introduction to Artificial Neural Networks with Keras).
- Andrew Ng and Tengyu Ma (2023). CS229 Lecture Notes. Ch. 7 (Deep Learning and Backpropagation).
- Francois Chollet (2021). Deep Learning with Python (2nd ed.). Manning. Ch. 2 (The Mathematical Building Blocks of Neural Networks).