Cross-Validation

Motivation: why this matters in quant finance

Cross-validation keeps model selection separate from final evaluation. In quant work, this is not bookkeeping. Hyperparameters, feature choices, thresholds, transformations, and data-cleaning rules can all overfit if selected by looking at the test set.

The basic idea is simple: repeatedly hold out part of the training data, fit on the rest, and measure performance on the holdout fold. The finance complication is time. Random folds are often wrong for forecasting because they can train on future regimes and validate on past ones.

The informal idea

Use the training set to create several miniature train/validation experiments. Each observation gets a turn as validation data. Average the scores to choose the model or hyperparameter. Only then evaluate once on the untouched test set.

For chronological data, preserve time ordering with rolling, expanding, or blocked splits.

Formal statement

In $K$ -fold cross-validation, split training indices into $K$ folds $I_1,\ldots,I_K$ . For each fold $k$ , fit on all indices except $I_k$ and evaluate on $I_k$ :

\text{CV}_K=\frac{1}{K}\sum_{k=1}^K L_{I_k}(\hat{f}^{(-k)}).

The selected hyperparameter is the one with the best average validation score. The test set remains unused until the selection rule is fixed.

Implementation

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

rng = np.random.default_rng(43)
X = rng.normal(size=(240, 4))
y = 0.6 * X[:, 0] - 0.4 * X[:, 1] + 0.25 * rng.normal(size=240)
cv = TimeSeriesSplit(n_splits=5)
for alpha in [0.1, 1.0, 10.0]:
    model = make_pipeline(StandardScaler(), Ridge(alpha=alpha))
    scores = cross_val_score(model, X, y, cv=cv, scoring="neg_mean_squared_error")
    print(alpha, round(-scores.mean(), 4))
# 0.1 0.0711
# 1.0 0.0709
# 10.0 0.0701

The scaler sits inside the pipeline, so each fold learns scaling parameters only from its training slice. Scaling before splitting is leakage.

Key properties and trade-offs

Property	Meaning	Finance consequence
Repeated validation	Each fold gives a noisy out-of-sample estimate.	Average scores are more stable than one lucky split.
Pipeline discipline	Preprocessing is fit inside each fold.	Prevents feature leakage.
Time order matters	Random folds can train on the future.	Use time-series splits for forecasting and strategy research.
Test set is final	It is touched after model selection.	Reusing it turns it into validation data.

Worked example: choosing ridge alpha

A researcher tries

\alpha\in\\{0.1,1,10\\}

for Ridge Regression. Cross-validation picks

10

. The correct next step is to refit ridge with

\alpha=10

on the full training set, then evaluate once on the test period.

Common confusions and pitfalls

"Cross-validation means random shuffling." Random shuffling is one version. Time-series data often needs chronological splits.

"The test set can choose hyperparameters." Once it influences choices, it is no longer a test set.

"Preprocessing before CV is harmless." Even a scaler can leak future distribution information.

Where this goes next

Ridge Regression: uses CV to select penalty strength.
Support Vector Machine (SVM): needs CV for $C$ , $\gamma$ , and kernel choices.
Random Forest: uses validation to tune tree depth, leaf size, and feature subsampling.
Regularisation: L1 vs L2: explains what the selected penalty is doing geometrically.

References

Aurelien Geron (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly. Ch. 2 (train/test discipline, GridSearchCV) and Ch. 3 (cross-validation for classification).
Andrew Ng and Tengyu Ma (2023). CS229 Lecture Notes. Ch. 9.3 (Model Selection via Cross Validation).
Deisenroth, Faisal, and Ong (2020). Mathematics for Machine Learning. Ch. 8 (Model Selection).