"ML Foundations (2/9) — Nearest Neighbors: The Simplicity and Pitfalls of k-NN"
Walk into machine learning with one idea: the answer for a new input is the average answer of its closest friends. That single sentence is the entire \(k\)-Nearest Neighbors (k-NN) algorithm. Precisely because it is so simple, it exposes the core machine learning concepts (bias-variance, the curse of dimensionality, non-parametric models, decision boundaries) more honestly than any fancier model.
0. Learning Objectives
- State the \(k\)-NN classification and regression rules in one formula and one sentence each.
- Define Euclidean, Manhattan, and cosine distances, and choose between them given the data.
- Explain how small vs. large \(k\) changes bias and variance, with both a picture and a formula.
- State the curse of dimensionality in one line and explain why it breaks distance-based methods.
- Train k-NN with sklearn and tune
kand the metric using grid search. - Name the time-complexity differences between KD-tree, Ball-tree, and brute-force backends.
1. 핵심 요약
- \(k\)-NN has no training step. It stores the data; at prediction time it looks up the \(k\) closest points and aggregates their labels.
- Classification → majority vote; regression → mean of the neighbors.
- Despite its simplicity, k-NN exposes almost every ML pitfall: distance choice, scale sensitivity, the curse of dimensionality, the bias-variance tradeoff.
- Backends in sklearn: KD-tree (low dimension), Ball-tree (mid dimension), brute force (high dimension or small n). Selection is automatic.
- The classic Cover & Hart (1967) bound: 1-NN's error is at most twice that of the Bayes-optimal classifier — astonishingly good for such a simple method.
2. Intuition — Get the Distance Right and the Answer Follows
2.1 The Algorithm in One Line
A new input \(x\) arrives. "Find the most similar past cases and aggregate what they answered." That is it.
For classification, the \(k\) neighbors vote; the majority class wins.
For regression, the \(k\) neighbors are averaged.
Training: store the data. Nothing else.
Prediction: compute all distances, pick the k smallest, aggregate.
This "use the data itself as the model" trait makes k-NN a non-parametric method — its effective number of parameters grows with the dataset.
2.2 Why Start Here
Three reasons.
- Baseline. Whatever ML problem you face, k-NN is the first comparison point. A complicated model that fails to beat k-NN deserves suspicion.
- Concept density. Distance, scale, dimensionality, bias, variance — all collide inside this one method.
- It is visual. Draw decision boundaries in 2D and the effect of \(k\) is immediate.
2.3 History
- Fix, E. & Hodges, J. L. Discriminatory Analysis – Nonparametric Discrimination: Consistency Properties (1951). USAF report — the statistical foundation.
- Cover, T. M. & Hart, P. E. Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, 13(1), 21–27 (1967). The most-cited theorem: 1-NN's error rate is at most twice the Bayes-optimal error.
What that 1967 result shows is how close a startlingly simple algorithm gets to the theoretical lower bound.
3. Definitions & Notation
3.1 Distance
For two vectors \(x, x' \in \mathbb{R}^{d}\):
$$ d_{\text{Euclidean}}(x, x') = \sqrt{\sum_{j=1}^{d} (x_j - x'_j)^2} $$
$$ d_{\text{Manhattan}}(x, x') = \sum_{j=1}^{d} |x_j - x'_j| $$
$$ d_{\text{Minkowski}, p}(x, x') = \Big(\sum_{j=1}^{d} |x_j - x'_j|^{p}\Big)^{1/p} $$
\(p=1\) gives Manhattan, \(p=2\) gives Euclidean, \(p \to \infty\) gives Chebyshev (\(\max_j |x_j - x'_j|\)).
When direction matters more than magnitude, use the cosine distance:
$$ d_{\cos}(x, x') = 1 - \frac{x \cdot x'}{\|x\| \, \|x'\|} $$
Cosine is the standard in text embeddings and most LLM representation spaces.
3.2 k-NN Classification
Let \(\mathcal{N}_k(x) = \{x_{(1)}, \dots, x_{(k)}\}\) be the \(k\) training points nearest to \(x\). The classification prediction is
$$ \hat{y}(x) = \arg\max_{c \in \{1,\dots,C\}} \sum_{i \in \mathcal{N}_k(x)} \mathbb{1}[y_i = c] $$
Weighted k-NN uses inverse distance:
$$ \hat{y}(x) = \arg\max_{c} \sum_{i \in \mathcal{N}_k(x)} \frac{1}{d(x, x_i)} \, \mathbb{1}[y_i = c] $$
3.3 k-NN Regression
$$ \hat{y}(x) = \frac{1}{k} \sum_{i \in \mathcal{N}_k(x)} y_i $$
or a distance-weighted average
$$ \hat{y}(x) = \frac{\sum_{i \in \mathcal{N}_k(x)} w_i \, y_i}{\sum_{i \in \mathcal{N}_k(x)} w_i}, \quad w_i = \frac{1}{d(x, x_i) + \epsilon} $$
4. Math & Mechanism — Bias, Variance, Dimensionality
4.1 \(k\) and the Bias-Variance Tradeoff
\(k=1\): the model fits the training set perfectly. Training error = 0. But a tiny perturbation in the new input drags the prediction along → high variance, low bias.
\(k=N\) (all the data): the model always answers the majority class. High bias, low variance.
\(k\) interpolates between the two extremes. For regression, averaging \(k\) neighbors shrinks the variance roughly as \(1/k\) (under an independence assumption):
$$ \mathrm{Var}\Big(\frac{1}{k}\sum_{i=1}^{k} y_i\Big) = \frac{\sigma^2}{k} $$
Larger \(k\) drops variance but raises bias because increasingly distant neighbors get pulled in. The optimal \(k\) depends on the data's noise level and the curvature of the decision boundary.
4.2 The Shape of the Decision Boundary
k-NN lives on a Voronoi partition. Each training point owns a cell, and points inside that cell inherit its label (for 1-NN).
. . .
+ | | + | ← each cell belongs to one training point
. . .
- | + | + |
. . .
Increase \(k\) and the boundary smooths out. At \(k=1\) it is jagged; at \(k=N\) it disappears entirely (the whole space becomes the majority class).
4.3 The Curse of Dimensionality
In high-dimensional space, all points become roughly equidistant.
Sprinkle \(n\) random points uniformly inside \([0,1]^d\) and measure distances from the origin. As \(d\) grows,
$$ \frac{\text{max distance} - \text{min distance}}{\text{min distance}} \;\xrightarrow{d \to \infty}\; 0 $$
Every neighbor is equally far away. The notion of "nearest" collapses.
A more visceral version: to fill 90% of the volume of a \(d\)-cube, the side length must be \(0.9^{1/d}\). For \(d=10\) it is 0.99; for \(d=100\) it is 0.999. Almost all the volume sits near the surface in high dimensions, which destroys distance-based reasoning.
Mitigations:
- Dimensionality reduction: PCA, autoencoders.
- Feature selection: keep the meaningful axes.
- Metric learning: Mahalanobis distance, large-margin nearest neighbor.
- Use a learned embedding space: text embeddings are commonly \(d=768\), yet cosine distance there is still meaningful because the embedding itself absorbs the curse.
4.4 Scale Sensitivity
If features have different units (height in cm, weight in kg), the larger-valued feature dominates the distance — a 1 cm difference in height shouldn't outweigh a 1 kg difference in weight, but it will if you forget to scale.
Fix: StandardScaler (zero mean, unit variance). The Pipeline pattern from Part 1 applies directly.
5. Diagram — What \(k\) Changes
A schematic view:
k=1: + - + - + boundary: jagged
↕
k=5: + - + boundary: smooth
↕
k=N: everyone → majority class no boundary
6. Principle Walkthrough — Unpacking the k-NN One-Liner
The algorithm itself is two lines — "compute distances, take the majority vote of the \(k\) nearest." The interesting part is that those two lines define a non-learning model, and that fact carries deeper meaning than the code itself.
6.1 The Standard Call
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5, weights="distance").fit(X_tr, y_tr)
There is almost no learning step in this line. fit just stores the training data in an index — no parameters get optimized. The cost is deferred to inference, where every query needs \(O(nd)\) distance computations. This is the very definition of a non-parametric model, and the reason k-NN sits at the start of every ML curriculum.
6.2 The Distance Metric Is a Domain Decision
What you compare Minkowski distance \(d_p(x, x') = \big(\sum_j |x_j - x'_j|^p\big)^{1/p}\) gives Euclidean for \(p=2\), Manhattan for \(p=1\), Chebyshev for \(p \to \infty\). The parameter \(p\) defines the shape of the space.
Why each survives - Euclidean (\(p=2\)): isotropic, rotation-invariant, differentiable — the default for neural-net embeddings. - Manhattan (\(p=1\)): natural on grids; more robust to outliers; city-block / chessboard distances. - Cosine: compares directions, not magnitudes — the de facto standard for text embeddings, where word-frequency magnitude matters less than direction.
Choosing a metric is answering a domain question: what does it mean for two points to be similar here? Section 7.3 generalizes this to learning the metric.
6.3 The Curse of Dimensionality — When Distance Stops Meaning Anything
Observed pattern Sample 1,000 points uniformly from a \(d\)-dimensional unit cube; for a random query, the ratio \((\max - \min)/\min\) between the farthest and nearest distance behaves as:
| \(d\) | 2 | 5 | 10 | 50 | 100 | 500 | 1000 |
|---|---|---|---|---|---|---|---|
| Ratio | 14.2 | 4.7 | 2.6 | 0.99 | 0.68 | 0.29 | 0.21 |
What it means At \(d = 1000\), the gap between farthest and nearest is only 20% of the nearest distance — "near" has essentially lost its content. The k-NN majority vote converges to random selection (Beyer et al. 1999, When Is "Nearest Neighbor" Meaningful?).
Forward link Part 6's representation learning is the direct answer. Even if raw inputs sit in 1,000 dimensions, a learned embedding compresses them into a meaningful 64–512-dimensional space — and distance becomes meaningful again. Parts 6 and 8 (embeddings, attention) do exactly this.
6.4 Decision Boundary Visualization (concept)
What you are looking at Scatter the training points on a 2D plane, build a fine grid that covers the plane, run k-NN at every grid point, and color the plane by predicted class. The lines where the color changes are the decision boundary — a visible portrait of how the model carves classes apart.
What \(k\) changes - \(k=1\): the boundary is jagged. Each training point gets its Voronoi cell handed straight to the boundary. A single outlier can flip neighboring cells. High variance. - \(k=10\): the boundary smooths out. The majority vote dampens any single outlier. Higher bias, lower variance. - \(k=n\) (the whole training set): the most frequent class always wins. The boundary disappears; the model has "forgotten" the data and collapsed to the prior.
Why this visual is so useful pedagogically
The bias–variance trade-off is abstract on paper. Watching the boundary morph as \(k\) moves from 1 → 5 → 20 → 100 turns it into something concrete — and the same vocabulary then carries cleanly into every classifier you meet later: the straight line of logistic regression, the staircase of decision trees, the margin of an SVM, the curved isocontours of a neural net. The sklearn gallery example plot_classification.html is the canonical meshgrid + contourf template.
Limits and what comes next You can't directly visualize a boundary in three or more dimensions. Two workarounds: (1) project to 2D with PCA, t-SNE, or UMAP and approximate the boundary on that projection; (2) fix all but two variables and slice. The next post (linear regression) trades the boundary for a regression line — the categorical \(y\) becomes continuous, but the underlying object — the "prediction surface" the model draws over input space — is the same idea.
7. Variants & Case Studies — Why k-NN Is the Root of Embedding Search
A simple algorithm survived to 2026 because nearest-neighbor lookup is itself the core operation behind recommendation, search, and RAG.
7.1 Distance-Weighted k-NN
What changes
Instead of treating all neighbors equally, weight by inverse distance (weights="distance"), giving \(w_i = 1/(d_i + \epsilon)\) so that far-away neighbors lose their say.
Why it appeared With fixed \(k\), the \(k\)-th neighbor can sit at an essentially unrelated distance. Letting that point vote equally roughens the boundary. Dudani 1976, The Distance-Weighted k-NN Rule, established weighting as a standard refinement.
What it enabled Smoother decision boundaries; natural tie-breaking. In the regression version it matters more — the predicted value becomes a distance-weighted average, producing continuous outputs.
Limits and next step The weighting function is still fixed. Learning the weights gives Section 7.3's metric learning, and the most general form is the attention mechanism in neural nets.
7.2 Radius Neighbors
What changes
Drop the fixed \(k\); use all neighbors within radius \(r\) (RadiusNeighborsClassifier). The number of neighbors is now data-dependent.
Why it appeared With variable density, fixed \(k\) misbehaves: in dense regions \(k=5\) is plenty; in sparse regions even \(k=5\) reaches points so far away their contribution is noise.
What it enabled
- GIS — natural queries like "every store within 1 km."
- LOF (Local Outlier Factor) — outlier detection by measuring local density via radius.
- DBSCAN clustering — a core point is defined as having at least minPts neighbors within \(\epsilon\).
Limits and next step Choosing \(r\) is as hard as choosing \(k\). And in extremely sparse regions, neighbor count drops to zero — no prediction.
7.3 Metric Learning — Learn the Distance
What changes Replace Euclidean with a Mahalanobis distance \(d(x, x') = \sqrt{(x - x')^{\top} M (x - x')}\), and learn \(M\) so that same-class points are close and different-class points are far.
Why it appeared Euclidean treats every axis equally, but in real data the class-relevant axes are not all the same. Weinberger & Saul 2009, Distance Metric Learning for Large Margin Nearest Neighbor Classification (LMNN), is the canonical reference.
What it enabled - Face recognition embeddings — FaceNet (Schroff et al. 2015) learns "same person near, different person far." - Semantic search — Sentence-BERT (Reimers & Gurevych 2019) learns a sentence-similarity distance. - Recommendation — user and item embeddings trained so that "similar taste" is "close."
Limits and next step A single \(M\) defines a linear transformation. Nonlinear transformations are exactly the role of neural-network embeddings in Part 6 — and querying those embeddings with k-NN is the body of every RAG system.
7.4 Where k-NN Survives — Modern Neighbor Search
| Domain | Use | Standard tool |
|---|---|---|
| Recommendation | User-user / item-item neighborhoods | Collaborative filtering |
| Anomaly detection | LOF — local density | sklearn LocalOutlierFactor |
| Image search | Embed, then ANN | FAISS, ScaNN |
| Medical diagnosis | Case-based matching | EMR + clinical embeddings |
| RAG | Semantic search over embeddings | OpenAI text-embedding-3, Cohere rerankers |
| Plagiarism detection | Sentence embeddings + cosine similarity | Semantic Scholar API |
The framing: the past decade of ML progress can be summed up as "where are we doing k-NN?" The shift from raw pixels to learned embeddings is the entire story — the algorithm stays nearly identical.
7.5 Scaling — From Brute Force to ANN
Computing \(O(nd)\) for a million vectors per query is not viable. Approximate Nearest Neighbor (ANN) algorithms solve this. - KD-tree (low \(d\), \(d < 20\)): recursive space splits. - Ball-tree: bounds distance computations. - LSH (Locality-Sensitive Hashing): similar points hash to the same bucket. - HNSW (Hierarchical Navigable Small World, Malkov 2018): graph-based; the current default. - IVF + PQ (FAISS): clustering plus quantization — fast and memory-efficient.
ANN delivers 99.5% recall of true k-NN at 100× the speed. This is how RAG systems hit millisecond response times.
8. Limits & Failure Modes — k-NN's Limits Drew the Map of the Next 100 Years
8.1 \(O(nd)\) Inference — Free Training, Expensive Predictions
Why it is essential k-NN does nothing at training time. All computation is deferred to inference. With a million training points, every query is a million distance computations.
How you spot it Measure latency in a real-time service. Per-query > 100 ms is impractical for brute force.
Next step ANN (HNSW, FAISS, ScaNN) for 100× faster lookup at 99.5% accuracy. Or switch to parametric models (linear regression, neural nets) — pay compute at training time, inference at \(O(d)\).
8.2 Memory — You Must Carry All the Training Data
Why it is essential Parametric models can discard the data after training; only the weights matter. k-NN cannot — the training set is the model. One million samples × 1,000 features × 4 bytes ≈ 4 GB in RAM.
How you spot it Compare your deployment RAM against the dataset size.
Next step Prototype selection (NCA, condensed k-NN), quantization (FAISS Product Quantization), or dimensionality reduction before k-NN.
8.3 Scale Sensitivity — Meters and Millimeters Get the Same Weight
Why it is essential If one feature has a unit 1,000× larger than another, that feature dominates the distance. Heights (cm) and weights (kg) together → cm overwhelms kg, and the model effectively ignores weight.
How you spot it Check per-feature std. A 10× gap in scale is enough to demand normalization.
Next step
StandardScaler (z-score) or RobustScaler (median + IQR) inside a Pipeline — always. Section 7.3's metric learning generalizes this.
8.4 Missing Values — Distance Itself Breaks
Why it is essential
A single np.nan in any feature turns \((x_j - x'_j)^2\) into NaN and the full distance is undefined. Distance-based algorithms have no graceful fallback.
How you spot it
X.isna().any() — even one NaN means the algorithm cannot run.
Next step Imputation upfront (mean, median, or k-NN imputation). Or partial distance — compute over the non-missing dimensions and rescale.
8.5 Class Imbalance — The Majority Crowds Out the Minority
Why it is essential With 1% positive rate and \(k=5\), the expected number of positives among 5 neighbors is 0.05 — almost always 0. The majority vote always predicts negative; the model never sees positives.
How you spot it classification report's positive-class recall near 0.
Next step Weighted k-NN, oversampling (SMOTE), or undersampling the majority class. The same pattern as Section 7.1 of Part 1.
8.6 Curse of Dimensionality — Distance Loses Meaning
Why it is essential As shown in the Section 6.3 table, at \(d \to \infty\) all points sit at near-identical distances. Majority vote degenerates into random selection.
How you spot it \(d > 50\) on roughly-uniform data.
Next step Representation learning — PCA, autoencoders, neural-net embeddings. Compress 1,000-dim pixels to a meaningful 128-dim embedding. The driving force behind Parts 6 and 8.
What the Limits Sketch
k-NN's six limits compress the next century of ML: - \(O(nd)\) inference → parametric models (Part 3 linear, Part 6 neural) - Memory → weight compression and quantization (Part 9 local LLM) - Scale sensitivity → normalization, BatchNorm (Part 5 regularization, Part 7 training) - Missing values → imputation models - Class imbalance → cost-sensitive learning (Part 1 §7.1) - Curse of dimensionality → representation learning (Part 6 representations, Part 8 embeddings)
Understanding k-NN means seeing the motivation behind every subsequent ML advance compressed into one model.
9. Quick Recap — Answer Before You Peek
Five core questions this article answered. Cover the answers, give a one-line response yourself, then check.
Q1. What does k-NN's fit actually do, and why is it called a non-parametric model?
Answer fit just stores the training set in an index; no parameters are learned. Distances are computed at inference time.
Why No learned weights — the data itself is the model. Cost: \(O(nd)\) per query, plus storing the whole dataset in memory. (Sections 6.1, 8.1, 8.2.)
Q2. As \(k\) grows from 1 to \(n\), what happens to bias and variance?
Answer Small \(k\) → high variance, low bias. Large \(k\) → high bias, low variance. \(k = n\) → always the majority class (maximum bias). Why Averaging more neighbors smooths noise (variance down) but flattens the boundary (bias up). This is the bias-variance decomposition (Part 1, §1.3) made visual. (Section 6.4.)
Q3. Why does k-NN fail in 1,000-dimensional spaces?
Answer The curse of dimensionality — nearest and farthest distances become nearly equal, so "near" loses meaning. Why The \((\max - \min)/\min\) ratio collapses from 14 at \(d=2\) to 0.2 at \(d=1000\) (Beyer et al. 1999). The answer is representation learning — compress to a meaningful 64–512-dim embedding. (Sections 6.3, 8.6.)
Q4. Why is RAG (Retrieval-Augmented Generation) essentially a variant of k-NN?
Answer RAG retrieval is "find the \(k\) document embeddings closest to the query embedding" — k-NN over an embedding space. Why The progress in ML is not in the algorithm but in where k-NN is run: from raw token counts to learned embeddings. At scale, ANN (FAISS, HNSW) accelerates it 100×. (Sections 7.4, 7.5.)
Q5. What happens if you skip StandardScaler and run k-NN on height (cm) and weight (kg)?
Answer The cm values (around 170) dwarf the kg values (around 65); distance is decided almost entirely by height. The model effectively ignores weight. Why Distance is isotropic, so unit gaps translate directly into weight gaps. Normalization is not optional — it is mandatory. Metric learning generalizes the same fix. (Sections 6.2, 8.3.)
If four or five came out as one-liners, k-NN is solidly in place. For any question that stalled, return to the section in parentheses.
10. Further Reading
Primary sources
- Cover, T. M., Hart, P. E. Nearest Neighbor Pattern Classification. IEEE TIT 13(1), 21–27 (1967). — The 1-NN error bound.
- Fix, E., Hodges, J. L. Discriminatory Analysis – Nonparametric Discrimination: Consistency Properties (1951). USAF School of Aviation Medicine, Report Number 4. — The statistical foundation.
- Weinberger, K. Q., Saul, L. K. Distance Metric Learning for Large Margin Nearest Neighbor Classification. JMLR 10 (2009): 207–244. — Learned metrics.
Official docs
- sklearn Nearest Neighbors:
https://scikit-learn.org/stable/modules/neighbors.html - KD-tree / Ball-tree internals:
https://scikit-learn.org/stable/modules/neighbors.html#brute-force
Companion books
- Hastie, Tibshirani, Friedman. The Elements of Statistical Learning. Sections 2.3 and Chapter 13.
- Aggarwal, C. C. Outlier Analysis. 2nd ed., Springer 2017. LOF and related k-NN anomaly methods.
In Part 3 we'll meet linear regression — the oldest method for fitting a line (or a hyperplane) to data. Least squares, the normal equations, and gradient descent all meet in one place, and the punchline of all of machine learning ("everything is, in the end, minimizing a loss") becomes visible in a single formula.
Series overview: Series index
댓글
댓글 쓰기