CONTENTS

Random Forest

Motivation: why this matters in quant finance

Random forests are strong tabular-data baselines for finance. They handle nonlinear thresholds, mixed scales, feature interactions, and noisy predictors with little manual feature engineering. That makes them useful for credit scoring, execution-quality prediction, volatility-regime labelling, and robust research baselines.

A forest is not a single interpretable rule list. It is a variance-reduction machine built from many Decision Tree models trained on different bootstrap samples and feature subsets.

The informal idea

Train many trees. Each tree sees a random sample of observations and usually a random subset of features at each split. For classification, let trees vote. For regression, average their predictions.

The forest works when trees are individually useful and not perfectly correlated. More trees reduce simulation noise in the ensemble; feature subsampling and bootstrapping reduce correlation between trees.

Formal statement

For regression trees Tb(x)T_b(\mathbf{x}), b=1,,Bb=1,\ldots,B,

f^(x)=1Bb=1BTb(x).\hat{f}(\mathbf{x})=\frac{1}{B}\sum_{b=1}^B T_b(\mathbf{x}).

For classification, average class probabilities or take a majority vote.

Implementation

import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split rng = np.random.default_rng(29) ret = rng.normal(0, 1, size=600) vol = rng.lognormal(mean=0.0, sigma=0.35, size=600) spread = rng.normal(0, 1, size=600) stress = (-0.8 * ret + 1.1 * vol + 0.5 * spread > 1.2).astype(int) X = np.c_[ret, vol, spread] X_train, X_test, y_train, y_test = train_test_split(X, stress, test_size=0.3, random_state=29, stratify=stress) forest = RandomForestClassifier(n_estimators=300, max_features="sqrt", min_samples_leaf=5, oob_score=True, random_state=29) forest.fit(X_train, y_train) print(round(forest.oob_score_, 3), round(forest.score(X_test, y_test), 3)) # 0.933 0.928

Out-of-bag evaluation is a useful internal check because each tree leaves out some bootstrap observations. It is not a replacement for a final holdout.

Key properties and trade-offs

PropertyMeaningFinance consequence
BaggingTrees train on bootstrap samples.Reduces single-tree variance.
Feature subsamplingSplits see only some predictors.Decorrelates trees and limits dominant features.
OOB evaluationLeft-out bootstrap observations estimate error.Useful during research, not final reporting.
Limited extrapolationPredictions average leaf outcomes.Poor outside historical support.

Worked example: execution-quality classifier

A forest can combine spread, order size, volatility, queue imbalance, and recent trade direction to classify high-slippage orders. A linear model needs explicit interactions for many of these effects. A forest can learn that large orders are problematic mainly when spread and volatility are already high.

Common confusions and pitfalls

"More trees always fix overfitting." More trees reduce ensemble variance. They do not fix leakage, bad labels, or tiny leaves.
"Out-of-bag is the test set." OOB is diagnostic. Keep an untouched chronological test period.
"Feature importance is stable." Correlated predictors can split importance between themselves.

Where this goes next

References

  • Aurelien Geron (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly. Ch. 7 (Ensemble Learning and Random Forests).
  • Avrim Blum, John Hopcroft, and Ravindran Kannan (2020). Foundations of Data Science. Ch. 5 (Overfitting and generalization themes).
Random Forest | q4quant.studio