FIG. 13 · ATLASK-Nearest Neighbors
No training.
Just memory.
There is no model. To predict, find the k closest training points and vote. Trivially simple, often surprisingly competitive — but every prediction scans the dataset, which is the catch.
Click anywhere to add a training point of the active class. The background shading is the predicted class for every region of the canvas. Move the k slider to feel the bias-variance trade-off — k=1 is jagged and overfit, k=15 is smooth and biased.
§ IThe boundary is the data
Every grid cell shows the prediction. The decision boundary is wherever neighbor votes flip.
§ IIHow it works
KNN has no training step. The model is the training set. Predict by computing the distance from the new point to every training point, sorting, taking the top k, and majority-voting their labels. Done.
The "no training" claim is real but lopsided. You skip the training cost and pay it back at prediction time, on every single query. Brute-force KNN scans the entire dataset for each prediction; production KNN uses spatial indexes (KD-trees, ball trees, locality-sensitive hashing) to cut that cost.
The math
For a query point x and training set {(x_i, y_i)}:
N_k(x) = argmin_(I ⊂ {1..n}, |I|=k) Σ_(i ∈ I) ‖x − x_i‖²The prediction is then mode (or mean, for regression) of {y_i : i ∈ N_k(x)}.
Distance metric matters. Euclidean is the default; Manhattan, cosine, and Mahalanobis cover other geometries. Feature scaling matters more — KNN treats all dimensions equally, so a feature in [0, 100000] dominates one in [0, 1].
§ IIIWhere it shines, where it breaks
Tiny datasets
With fewer than a few thousand points, KNN's "no model" is a feature: no overfitting risk via misspecified architecture. The data IS the model.
Local structure
When class membership depends on local neighborhoods rather than global linear or polynomial trends, KNN naturally captures that. Image and signal retrieval lean on it for this reason.
High-dimensional spaces
The curse of dimensionality is brutal here. In 100 dimensions, every point is roughly equidistant from every other, and "nearest" loses meaning. PCA or feature selection first.
Inference at QPS
Try the sparse preset and crank k to 15. Each prediction sorts the entire dataset. For a million-point production index, you cannot afford that. KD-trees help up to ~30 dimensions, then degrade.
§ IVTrade-off scorecard
- Inference0.30
- Accuracy0.65
- Training1.00
- Small size0.20
§ VIn production
Spotify's "More like this" similarity search. Approximate KNN over learned embeddings — not raw audio — powers the "fans also like" rails on every artist page. The embeddings are learned by neural networks; the retrieval is brute KNN with HNSW indexing, because nothing else matches its semantic flexibility at that scale.