CONTENTS

Cross-Validation

Motivation: why this matters in quant finance

Cross-validation keeps model selection separate from final evaluation. In quant work, this is not bookkeeping. Hyperparameters, feature choices, thresholds, transformations, and data-cleaning rules can all overfit if selected by looking at the test set.

The basic idea is simple: repeatedly hold out part of the training data, fit on the rest, and measure performance on the holdout fold. The finance complication is time. Random folds are often wrong for forecasting because they can train on future regimes and validate on past ones.

The informal idea

Use the training set to create several miniature train/validation experiments. Each observation gets a turn as validation data. Average the scores to choose the model or hyperparameter. Only then evaluate once on the untouched test set.

For chronological data, preserve time ordering with rolling, expanding, or blocked splits.

Formal statement

In KK-fold cross-validation, split training indices into KK folds I1,,IKI_1,\ldots,I_K. For each fold kk, fit on all indices except IkI_k and evaluate on IkI_k:

CVK=1Kk=1KLIk(f^(k)).\text{CV}_K=\frac{1}{K}\sum_{k=1}^K L_{I_k}(\hat{f}^{(-k)}).

The selected hyperparameter is the one with the best average validation score. The test set remains unused until the selection rule is fixed.

Implementation

import numpy as np from sklearn.linear_model import Ridge from sklearn.model_selection import TimeSeriesSplit, cross_val_score from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler rng = np.random.default_rng(43) X = rng.normal(size=(240, 4)) y = 0.6 * X[:, 0] - 0.4 * X[:, 1] + 0.25 * rng.normal(size=240) cv = TimeSeriesSplit(n_splits=5) for alpha in [0.1, 1.0, 10.0]: model = make_pipeline(StandardScaler(), Ridge(alpha=alpha)) scores = cross_val_score(model, X, y, cv=cv, scoring="neg_mean_squared_error") print(alpha, round(-scores.mean(), 4)) # 0.1 0.0711 # 1.0 0.0709 # 10.0 0.0701

The scaler sits inside the pipeline, so each fold learns scaling parameters only from its training slice. Scaling before splitting is leakage.

Key properties and trade-offs

PropertyMeaningFinance consequence
Repeated validationEach fold gives a noisy out-of-sample estimate.Average scores are more stable than one lucky split.
Pipeline disciplinePreprocessing is fit inside each fold.Prevents feature leakage.
Time order mattersRandom folds can train on the future.Use time-series splits for forecasting and strategy research.
Test set is finalIt is touched after model selection.Reusing it turns it into validation data.

Worked example: choosing ridge alpha

A researcher tries α0.1,1,10\alpha\in\\{0.1,1,10\\} for Ridge Regression. Cross-validation picks 1010. The correct next step is to refit ridge with α=10\alpha=10 on the full training set, then evaluate once on the test period.

Common confusions and pitfalls

"Cross-validation means random shuffling." Random shuffling is one version. Time-series data often needs chronological splits.
"The test set can choose hyperparameters." Once it influences choices, it is no longer a test set.
"Preprocessing before CV is harmless." Even a scaler can leak future distribution information.

Where this goes next

References

  • Aurelien Geron (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly. Ch. 2 (train/test discipline, GridSearchCV) and Ch. 3 (cross-validation for classification).
  • Andrew Ng and Tengyu Ma (2023). CS229 Lecture Notes. Ch. 9.3 (Model Selection via Cross Validation).
  • Deisenroth, Faisal, and Ong (2020). Mathematics for Machine Learning. Ch. 8 (Model Selection).

Exercises

Test your understanding with 3 exercises for this lesson.