K-Nearest Neighbors (KNN)
Motivation: why this matters in quant finance
K-nearest neighbors is the simplest useful reminder that machine learning can be local. Instead of fitting coefficients, splits, or margins, kNN stores the training examples and predicts from the most similar historical cases.
That makes it a natural baseline for analogue reasoning: find past market days with similar volatility, trend, and liquidity; inspect what happened next; average the outcomes. It is rarely the final model for large finance datasets, but it is an excellent diagnostic for feature geometry.
The informal idea
For a new observation, compute its distance to every training observation. Pick the closest. For classification, vote among labels. For regression, average target values.
Formal statement
Given training observations and distance , let be the nearest training indices to .
For regression,
For classification,
Implementation
import numpy as np
class KNNClassifier:
"""Tiny kNN classifier using Euclidean distance."""
def __init__(self, k: int = 5):
self.k = k
def fit(self, X: np.ndarray, y: np.ndarray):
self.X_train = X
self.y_train = y
return self
def predict(self, X: np.ndarray) -> np.ndarray:
out = []
for x in X:
distances = np.linalg.norm(self.X_train - x, axis=1)
neighbors = np.argsort(distances)[:self.k]
labels, counts = np.unique(self.y_train[neighbors], return_counts=True)
out.append(labels[np.argmax(counts)])
return np.array(out)
rng = np.random.default_rng(19)
calm = rng.normal([0.0, 0.8], [0.5, 0.2], size=(40, 2))
stress = rng.normal([-1.0, 1.8], [0.5, 0.3], size=(40, 2))
X = np.vstack([calm, stress])
y = np.array([0] * 40 + [1] * 40)
print(KNNClassifier(k=5).fit(X, y).predict(np.array([[-0.8, 1.7], [0.2, 0.7]])))
# [1 0]The two features can be read as return and realised volatility. The model predicts by local analogy.
Key properties and trade-offs
| Property | Meaning | Finance consequence |
|---|---|---|
| Instance-based | The data is the model. | Stale history directly affects predictions. |
| Metric-dependent | Distance defines similarity. | Scaling and feature design are central. |
| Nonparametric | Boundaries can be nonlinear without coefficients. | Flexible, but hard to explain compactly. |
| Slow prediction | Naive prediction compares with many stored points. | Large tick datasets need indexing or approximation. |
| Curse of dimensionality | Distances degrade in high dimension. | Use careful feature selection or dimension reduction. |
Worked example: regime analogy
If today's features are return and realised volatility , and the five nearest historical neighbours are stress days, kNN classifies today as stress. This is not a beta, split, or margin; it is local historical analogy.
Common confusions and pitfalls
Where this goes next
- Decision Tree: learns local regions through splits instead of distances.
- Support Vector Machine (SVM): builds a boundary from support vectors.
- Cross-Validation: selects and preprocessing choices.
- Random Forest: handles nonlinear tabular structure through tree averaging.
References
- Aurelien Geron (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly. Ch. 1 (instance-based learning) and Ch. 3 (classification workflows).
- Broad ML textbook extract,
resources/ml/Copy of 11. Machine Learning.txt. Ch. 3 (K Nearest Neighbours), consulted as secondary OCR-backed support.