Random Forest
Motivation: why this matters in quant finance
Random forests are strong tabular-data baselines for finance. They handle nonlinear thresholds, mixed scales, feature interactions, and noisy predictors with little manual feature engineering. That makes them useful for credit scoring, execution-quality prediction, volatility-regime labelling, and robust research baselines.
The informal idea
Train many trees. Each tree sees a random sample of observations and usually a random subset of features at each split. For classification, let trees vote. For regression, average their predictions.
The forest works when trees are individually useful and not perfectly correlated. More trees reduce simulation noise in the ensemble; feature subsampling and bootstrapping reduce correlation between trees.
Formal statement
For regression trees , ,
For classification, average class probabilities or take a majority vote.
Implementation
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
rng = np.random.default_rng(29)
ret = rng.normal(0, 1, size=600)
vol = rng.lognormal(mean=0.0, sigma=0.35, size=600)
spread = rng.normal(0, 1, size=600)
stress = (-0.8 * ret + 1.1 * vol + 0.5 * spread > 1.2).astype(int)
X = np.c_[ret, vol, spread]
X_train, X_test, y_train, y_test = train_test_split(X, stress, test_size=0.3, random_state=29, stratify=stress)
forest = RandomForestClassifier(n_estimators=300, max_features="sqrt", min_samples_leaf=5, oob_score=True, random_state=29)
forest.fit(X_train, y_train)
print(round(forest.oob_score_, 3), round(forest.score(X_test, y_test), 3))
# 0.933 0.928Out-of-bag evaluation is a useful internal check because each tree leaves out some bootstrap observations. It is not a replacement for a final holdout.
Key properties and trade-offs
| Property | Meaning | Finance consequence |
|---|---|---|
| Bagging | Trees train on bootstrap samples. | Reduces single-tree variance. |
| Feature subsampling | Splits see only some predictors. | Decorrelates trees and limits dominant features. |
| OOB evaluation | Left-out bootstrap observations estimate error. | Useful during research, not final reporting. |
| Limited extrapolation | Predictions average leaf outcomes. | Poor outside historical support. |
Worked example: execution-quality classifier
A forest can combine spread, order size, volatility, queue imbalance, and recent trade direction to classify high-slippage orders. A linear model needs explicit interactions for many of these effects. A forest can learn that large orders are problematic mainly when spread and volatility are already high.
Common confusions and pitfalls
Where this goes next
- Decision Tree: supplies the base learner.
- Cross-Validation: tunes depth, leaf size, and feature subsampling.
- K-Nearest Neighbors (KNN): provides a contrasting nonparametric baseline.
- Neural Networks from Scratch: shifts to differentiable function approximation.
References
- Aurelien Geron (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly. Ch. 7 (Ensemble Learning and Random Forests).
- Avrim Blum, John Hopcroft, and Ravindran Kannan (2020). Foundations of Data Science. Ch. 5 (Overfitting and generalization themes).