CONTENTS

K-Nearest Neighbors (KNN)

Motivation: why this matters in quant finance

K-nearest neighbors is the simplest useful reminder that machine learning can be local. Instead of fitting coefficients, splits, or margins, kNN stores the training examples and predicts from the most similar historical cases.

That makes it a natural baseline for analogue reasoning: find past market days with similar volatility, trend, and liquidity; inspect what happened next; average the outcomes. It is rarely the final model for large finance datasets, but it is an excellent diagnostic for feature geometry.

The informal idea

For a new observation, compute its distance to every training observation. Pick the kk closest. For classification, vote among labels. For regression, average target values.

Training is almost empty: store the data. Prediction does the work. That is the opposite of Linear Regression, which compresses the training data into coefficients.

Formal statement

Given training observations (xi,yi)i=1n\\{(\mathbf{x}_i,y_i)\\}_{i=1}^n and distance dd, let Nk(x)N_k(\mathbf{x}_*) be the kk nearest training indices to x\mathbf{x}_*.

For regression,

y^(x)=1kiNk(x)yi.\hat{y}(\mathbf{x}_*)=\frac{1}{k}\sum_{i\in N_k(\mathbf{x}_*)}y_i.

For classification,

c^(x)=argmaxciNk(x)1yi=c.\hat{c}(\mathbf{x}_*)=\arg\max_c\sum_{i\in N_k(\mathbf{x}_*)}\mathbf{1}_{\\{y_i=c\\}}.

Implementation

import numpy as np class KNNClassifier: """Tiny kNN classifier using Euclidean distance.""" def __init__(self, k: int = 5): self.k = k def fit(self, X: np.ndarray, y: np.ndarray): self.X_train = X self.y_train = y return self def predict(self, X: np.ndarray) -> np.ndarray: out = [] for x in X: distances = np.linalg.norm(self.X_train - x, axis=1) neighbors = np.argsort(distances)[:self.k] labels, counts = np.unique(self.y_train[neighbors], return_counts=True) out.append(labels[np.argmax(counts)]) return np.array(out) rng = np.random.default_rng(19) calm = rng.normal([0.0, 0.8], [0.5, 0.2], size=(40, 2)) stress = rng.normal([-1.0, 1.8], [0.5, 0.3], size=(40, 2)) X = np.vstack([calm, stress]) y = np.array([0] * 40 + [1] * 40) print(KNNClassifier(k=5).fit(X, y).predict(np.array([[-0.8, 1.7], [0.2, 0.7]]))) # [1 0]

The two features can be read as return and realised volatility. The model predicts by local analogy.

Key properties and trade-offs

PropertyMeaningFinance consequence
Instance-basedThe data is the model.Stale history directly affects predictions.
Metric-dependentDistance defines similarity.Scaling and feature design are central.
NonparametricBoundaries can be nonlinear without coefficients.Flexible, but hard to explain compactly.
Slow predictionNaive prediction compares with many stored points.Large tick datasets need indexing or approximation.
Curse of dimensionalityDistances degrade in high dimension.Use careful feature selection or dimension reduction.

Worked example: regime analogy

If today's features are return 0.8-0.8\\% and realised volatility 1.71.7\\%, and the five nearest historical neighbours are stress days, kNN classifies today as stress. This is not a beta, split, or margin; it is local historical analogy.

Common confusions and pitfalls

"kNN has no hyperparameters." The value of kk, the metric, and scaling dominate behaviour.
"No training means no overfitting." With k=1k=1, the model can memorise noise.
"More features always help." Irrelevant features dilute distance.

Where this goes next

References

  • Aurelien Geron (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly. Ch. 1 (instance-based learning) and Ch. 3 (classification workflows).
  • Broad ML textbook extract, resources/ml/Copy of 11. Machine Learning.txt. Ch. 3 (K Nearest Neighbours), consulted as secondary OCR-backed support.
K-Nearest Neighbors (KNN) | q4quant.studio