K-Nearest Neighbors (KNN)

Motivation: why this matters in quant finance

K-nearest neighbors is the simplest useful reminder that machine learning can be local. Instead of fitting coefficients, splits, or margins, kNN stores the training examples and predicts from the most similar historical cases.

That makes it a natural baseline for analogue reasoning: find past market days with similar volatility, trend, and liquidity; inspect what happened next; average the outcomes. It is rarely the final model for large finance datasets, but it is an excellent diagnostic for feature geometry.

The informal idea

For a new observation, compute its distance to every training observation. Pick the $k$ closest. For classification, vote among labels. For regression, average target values.

Training is almost empty: store the data. Prediction does the work. That is the opposite of Linear Regression, which compresses the training data into coefficients.

Formal statement

Given training observations $\\{(\mathbf{x}_i,y_i)\\}_{i=1}^n$ and distance $d$ , let $N_k(\mathbf{x}_*)$ be the $k$ nearest training indices to $\mathbf{x}_*$ .

For regression,

\hat{y}(\mathbf{x}_*)=\frac{1}{k}\sum_{i\in N_k(\mathbf{x}_*)}y_i.

For classification,

\hat{c}(\mathbf{x}_*)=\arg\max_c\sum_{i\in N_k(\mathbf{x}_*)}\mathbf{1}_{\\{y_i=c\\}}.

Implementation

import numpy as np

class KNNClassifier:
    """Tiny kNN classifier using Euclidean distance."""
    def __init__(self, k: int = 5):
        self.k = k

    def fit(self, X: np.ndarray, y: np.ndarray):
        self.X_train = X
        self.y_train = y
        return self

    def predict(self, X: np.ndarray) -> np.ndarray:
        out = []
        for x in X:
            distances = np.linalg.norm(self.X_train - x, axis=1)
            neighbors = np.argsort(distances)[:self.k]
            labels, counts = np.unique(self.y_train[neighbors], return_counts=True)
            out.append(labels[np.argmax(counts)])
        return np.array(out)

rng = np.random.default_rng(19)
calm = rng.normal([0.0, 0.8], [0.5, 0.2], size=(40, 2))
stress = rng.normal([-1.0, 1.8], [0.5, 0.3], size=(40, 2))
X = np.vstack([calm, stress])
y = np.array([0] * 40 + [1] * 40)
print(KNNClassifier(k=5).fit(X, y).predict(np.array([[-0.8, 1.7], [0.2, 0.7]])))
# [1 0]

The two features can be read as return and realised volatility. The model predicts by local analogy.

Key properties and trade-offs

Property	Meaning	Finance consequence
Instance-based	The data is the model.	Stale history directly affects predictions.
Metric-dependent	Distance defines similarity.	Scaling and feature design are central.
Nonparametric	Boundaries can be nonlinear without coefficients.	Flexible, but hard to explain compactly.
Slow prediction	Naive prediction compares with many stored points.	Large tick datasets need indexing or approximation.
Curse of dimensionality	Distances degrade in high dimension.	Use careful feature selection or dimension reduction.

Worked example: regime analogy

If today's features are return $-0.8\\%$ and realised volatility $1.7\\%$ , and the five nearest historical neighbours are stress days, kNN classifies today as stress. This is not a beta, split, or margin; it is local historical analogy.

Common confusions and pitfalls

"kNN has no hyperparameters." The value of

k

, the metric, and scaling dominate behaviour.

"No training means no overfitting." With

k=1

, the model can memorise noise.

"More features always help." Irrelevant features dilute distance.

Where this goes next

Decision Tree: learns local regions through splits instead of distances.
Support Vector Machine (SVM): builds a boundary from support vectors.
Cross-Validation: selects $k$ and preprocessing choices.
Random Forest: handles nonlinear tabular structure through tree averaging.

References

Aurelien Geron (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly. Ch. 1 (instance-based learning) and Ch. 3 (classification workflows).
Broad ML textbook extract, resources/ml/Copy of 11. Machine Learning.txt. Ch. 3 (K Nearest Neighbours), consulted as secondary OCR-backed support.