Random Forest

Motivation: why this matters in quant finance

Random forests are strong tabular-data baselines for finance. They handle nonlinear thresholds, mixed scales, feature interactions, and noisy predictors with little manual feature engineering. That makes them useful for credit scoring, execution-quality prediction, volatility-regime labelling, and robust research baselines.

A forest is not a single interpretable rule list. It is a variance-reduction machine built from many Decision Tree models trained on different bootstrap samples and feature subsets.

The informal idea

Train many trees. Each tree sees a random sample of observations and usually a random subset of features at each split. For classification, let trees vote. For regression, average their predictions.

The forest works when trees are individually useful and not perfectly correlated. More trees reduce simulation noise in the ensemble; feature subsampling and bootstrapping reduce correlation between trees.

Formal statement

For regression trees $T_b(\mathbf{x})$ , $b=1,\ldots,B$ ,

\hat{f}(\mathbf{x})=\frac{1}{B}\sum_{b=1}^B T_b(\mathbf{x}).

For classification, average class probabilities or take a majority vote.

Implementation

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

rng = np.random.default_rng(29)
ret = rng.normal(0, 1, size=600)
vol = rng.lognormal(mean=0.0, sigma=0.35, size=600)
spread = rng.normal(0, 1, size=600)
stress = (-0.8 * ret + 1.1 * vol + 0.5 * spread > 1.2).astype(int)
X = np.c_[ret, vol, spread]
X_train, X_test, y_train, y_test = train_test_split(X, stress, test_size=0.3, random_state=29, stratify=stress)
forest = RandomForestClassifier(n_estimators=300, max_features="sqrt", min_samples_leaf=5, oob_score=True, random_state=29)
forest.fit(X_train, y_train)
print(round(forest.oob_score_, 3), round(forest.score(X_test, y_test), 3))
# 0.933 0.928

Out-of-bag evaluation is a useful internal check because each tree leaves out some bootstrap observations. It is not a replacement for a final holdout.

Key properties and trade-offs

Property	Meaning	Finance consequence
Bagging	Trees train on bootstrap samples.	Reduces single-tree variance.
Feature subsampling	Splits see only some predictors.	Decorrelates trees and limits dominant features.
OOB evaluation	Left-out bootstrap observations estimate error.	Useful during research, not final reporting.
Limited extrapolation	Predictions average leaf outcomes.	Poor outside historical support.

Worked example: execution-quality classifier

A forest can combine spread, order size, volatility, queue imbalance, and recent trade direction to classify high-slippage orders. A linear model needs explicit interactions for many of these effects. A forest can learn that large orders are problematic mainly when spread and volatility are already high.

Common confusions and pitfalls

"More trees always fix overfitting." More trees reduce ensemble variance. They do not fix leakage, bad labels, or tiny leaves.

"Out-of-bag is the test set." OOB is diagnostic. Keep an untouched chronological test period.

"Feature importance is stable." Correlated predictors can split importance between themselves.

Where this goes next

Decision Tree: supplies the base learner.
Cross-Validation: tunes depth, leaf size, and feature subsampling.
K-Nearest Neighbors (KNN): provides a contrasting nonparametric baseline.
Neural Networks from Scratch: shifts to differentiable function approximation.

References

Aurelien Geron (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly. Ch. 7 (Ensemble Learning and Random Forests).
Avrim Blum, John Hopcroft, and Ravindran Kannan (2020). Foundations of Data Science. Ch. 5 (Overfitting and generalization themes).