Decision Tree

Motivation: why this matters in quant finance

Decision trees turn nonlinear rules into a model that can still be inspected. A tree can learn that a signal matters only when volatility is high, liquidity is thin, or a spread crosses a threshold. That interaction is awkward for a plain linear model unless it is engineered by hand.

A single tree is also the base learner behind Random Forest. Understanding one tree makes ensembles less mysterious: a forest is variance reduction over many unstable trees.

The informal idea

A decision tree asks a sequence of questions. Is realised volatility below this threshold? If yes, go left. Is momentum positive? If no, go right. At a leaf, predict the average target or majority class among observations that reached that leaf.

Each split should make child nodes more homogeneous than the parent. This is local modelling through rectangles in feature space, not through neighbour distances as in K-Nearest Neighbors (KNN).

Formal statement

For classification, Gini impurity is

G_m = 1 - \sum_{k=1}^K p_{m,k}^2,

where $p_{m,k}$ is the class- $k$ fraction in node $m$ . CART searches for a feature $j$ and threshold $t$ that minimise weighted child impurity:

J(j,t)=\frac{n_L}{n_m}G_L+\frac{n_R}{n_m}G_R.

For regression trees, the analogous criterion is reduction in squared error.

Implementation

import numpy as np

class DecisionStump:
    """One-split regression tree for teaching the split criterion."""
    def fit(self, X: np.ndarray, y: np.ndarray):
        best = (np.inf, None, None)
        for j in range(X.shape[1]):
            for threshold in np.unique(X[:, j]):
                left = X[:, j] <= threshold
                if left.all() or (~left).all():
                    continue
                loss = ((y[left] - y[left].mean()) ** 2).sum()
                loss += ((y[~left] - y[~left].mean()) ** 2).sum()
                if loss < best[0]:
                    best = (loss, j, threshold)
        self.feature_, self.threshold_ = best[1], best[2]
        return self

rng = np.random.default_rng(23)
vol = rng.uniform(0.5, 2.5, size=80)
y = np.where(vol > 1.4, -0.02, 0.01) + 0.004 * rng.normal(size=80)
stump = DecisionStump().fit(vol.reshape(-1, 1), y)
print(stump.feature_, round(stump.threshold_, 2))
# 0 1.4

Key properties and trade-offs

Property	Meaning	Finance consequence
Axis-aligned splits	Each split thresholds one feature.	Captures simple regimes but needs many splits for diagonal boundaries.
Little scaling need	Trees do not require standardisation.	Convenient for mixed tabular features.
High variance	Small data changes can change the tree.	A single tree is often unstable across windows.
Interpretability	Paths can be read as rules.	Useful for research review and governance.

Worked example: volatility threshold

A tree might learn that a momentum signal works only when realised volatility is below 1.4%. The path is readable. If volatility is high, predict weak next-period return; if volatility is low, inspect momentum. The trade-off is threshold instability.

Common confusions and pitfalls

"Trees cannot overfit because they are simple rules." Deep trees can memorise noise with very specific paths.

"Feature importance is causal importance." It reflects split usefulness inside this fitted tree.

"No scaling means no preprocessing." Leakage-safe feature construction and validation still matter.

Where this goes next

Random Forest: averages many trees to reduce variance.
K-Nearest Neighbors (KNN): compares a distance-based kind of locality.
Cross-Validation: chooses depth and leaf-size controls.
Logistic Regression: provides a smoother linear-probability baseline.

References

Aurelien Geron (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly. Ch. 6 (Decision Trees, CART, Gini impurity, limitations).
Avrim Blum, John Hopcroft, and Ravindran Kannan (2020). Foundations of Data Science. Ch. 5.6.3 (Application: Learning Decision Trees) and Ch. 5.7 (Regularization).