k-Nearest Neighbors (kNN)

1. What Is k-Nearest Neighbors?

k-Nearest Neighbors (kNN) is a supervised machine learning algorithm used for both Classification and regression. It is one of the oldest and simplest algorithms in machine learning — proposed by Fix and Hodges in 1951 — and yet remains surprisingly competitive on many real-world problems.

The core idea is radical in its simplicity: to predict the label of a new point, look at the k most similar training points and let them vote.

No model is fit. No parameters are learned. No equations are optimized. The entire training dataset is the model.

This makes kNN a non-parametric, instance-based, lazy learning algorithm — three terms that each carry deep meaning, explored in detail throughout this document.

2. The Core Intuition

The kNN algorithm is grounded in one of the most fundamental assumptions in all of statistics:

The Continuity Assumption: Points that are close together in feature space are likely to have similar labels.

This is sometimes called the manifold assumption or the smoothness prior. It underpins not just kNN, but also kernel methods, Gaussian processes, and much of interpolation theory.

A concrete analogy: You move to a new city and want to know if your new neighborhood is safe. You don't build a statistical model of the whole city. You simply ask your 5 nearest neighbors. If 4 out of 5 say "yes, it's safe," you conclude it probably is. That's kNN.

The beauty and the danger of kNN are both rooted in this assumption. When it holds — when the feature space is meaningful and dense — kNN works remarkably well. When it breaks down — in high dimensions, with sparse data, or irrelevant features — kNN fails completely.

3. Mathematical Foundation

3.1 Distance Metrics

The entire algorithm depends on a single operation: measuring how far apart two points are. The choice of distance metric is the most consequential decision in kNN — different metrics encode completely different notions of "similarity."

Euclidean Distance (L2)

d(xᵢ, xⱼ) = √( Σₚ (xᵢₚ − xⱼₚ)² )

The straight-line distance in p-dimensional space. The default and most intuitive metric.

When to use: Features are continuous, on the same scale, and Euclidean geometry is meaningful (e.g., physical coordinates, normalized sensor readings).

Weakness: Dominated by features with large ranges. Assumes all dimensions are equally important. Very sensitive to scale.

Manhattan Distance (L1)

d(xᵢ, xⱼ) = Σₚ |xᵢₚ − xⱼₚ|

The sum of absolute differences — the distance a taxi cab travels on a grid. Also called City Block distance.

When to use: Features have meaningful absolute differences. More robust to outliers than L2 (doesn't square large deviations). Often preferred in high dimensions.

Insight: In very high dimensions, Manhattan distance is often more discriminative than Euclidean — L2's squaring collapses large dimensional spaces into a narrow range of values.

Minkowski Distance (Lₚ)

The generalization of both L1 and L2:

d(xᵢ, xⱼ) = ( Σₚ |xᵢₚ − xⱼₚ|^q )^(1/q)

q value	Distance metric
q = 1	Manhattan (L1)
q = 2	Euclidean (L2)
q → ∞	Chebyshev (max absolute difference)

Chebyshev Distance (L∞)

d(xᵢ, xⱼ) = max_p |xᵢₚ − xⱼₚ|

Only the largest difference across any single dimension matters. Used in chess (king moves) and warehouse routing.

Cosine Similarity / Distance

similarity(xᵢ, xⱼ) = (xᵢ · xⱼ) / (||xᵢ|| · ||xⱼ||)

cosine_distance = 1 − cosine_similarity

Measures the angle between two vectors, ignoring magnitude. Particularly powerful for text and high-dimensional sparse data (tf-idf vectors, document embeddings).

When to use: When the direction of the feature vector matters more than its magnitude (e.g., two documents using the same words with different frequencies are "similar").

Hamming Distance

d(xᵢ, xⱼ) = (1/p) · Σₚ 𝟙[xᵢₚ ≠ xⱼₚ]

The fraction of positions where two binary (or categorical) vectors differ.

When to use: Categorical features, binary strings, DNA sequences.

Mahalanobis Distance

d(xᵢ, xⱼ) = √( (xᵢ−xⱼ)ᵀ S⁻¹ (xᵢ−xⱼ) )

Where S is the covariance matrix of the data. This accounts for correlations between features and different feature scales simultaneously — essentially Euclidean distance in a whitened (decorrelated, normalized) space.

When to use: Features are correlated and on different scales. Equivalent to first applying PCA whitening, then using Euclidean distance.

Caveat: Requires computing and inverting the covariance matrix — expensive and unstable in high dimensions or with small datasets.

Distance Metric Summary

Metric	Best For	Scale Sensitive	Handles Correlation
Euclidean	Continuous, normalized features	✅ Yes	❌ No
Manhattan	High-dim, robust to outliers	✅ Yes	❌ No
Chebyshev	Uniform importance of worst feature	✅ Yes	❌ No
Cosine	Text, sparse, high-dim vectors	❌ No (angle)	❌ No
Hamming	Categorical / binary features	N/A	❌ No
Mahalanobis	Correlated, multi-scale features	❌ No	✅ Yes

3.2 Choosing k Neighbors

Once distances are computed, sort all training points by distance and take the k closest ones. These are the k-nearest neighbors of the query point.

Formally, for a query point x:

N_k(x) = {k training points xᵢ with the smallest d(x, xᵢ)}

The value of k is the most critical hyperparameter in kNN (see Section 10 for full bias-variance analysis).

3.3 The Voting Mechanism

For Classification, the predicted label is the majority class among the k neighbors:

ŷ = argmax_c  |{xᵢ ∈ N_k(x) : yᵢ = c}|

"Which class has the most votes among my k neighbors?"

Probability estimate:

P̂(y = c | x) = |{xᵢ ∈ N_k(x) : yᵢ = c}| / k

This is a simple class fraction — the proportion of neighbors belonging to class c.

Tie-breaking: With even k and two classes, ties are possible. Strategies:

Reduce k by 1 (use k−1)
Use distance-weighted voting (closer neighbor breaks the tie)
Randomly break ties (adds stochasticity)

3.4 Weighted kNN

Standard kNN treats all k neighbors as equally important. But a neighbor at distance 0.01 should intuitively matter more than one at distance 10.0.

Distance-weighted voting:

ŷ = argmax_c  Σᵢ∈N_k(x)  wᵢ · 𝟙[yᵢ = c]

Where the weight of each neighbor is:

wᵢ = 1 / d(x, xᵢ)²        (inverse distance squared)

Or more generally, any decreasing function of distance. The Gaussian kernel is a smooth alternative:

wᵢ = exp(−d(x, xᵢ)² / 2σ²)

Benefits of weighting:

Resolves ties naturally
More robust to outliers in the neighbor set
Closer points exert stronger influence

Edge case: If d(x, xᵢ) = 0 for some training point (query equals a training point exactly), set wᵢ = ∞ and predict that point's label directly.

In sklearn:

KNeighborsClassifier(weights='uniform')    # Equal weights
KNeighborsClassifier(weights='distance')   # Inverse distance weights

4. How It Works — Step by Step

Given: A training dataset of (feature vector, label) pairs. A new query point x.

Step 1 — Compute distances:

Compute d(x, x₁), d(x, x₂), ..., d(x, xₘ)  for all m training points

Step 2 — Sort and select:

Sort training points by distance to x (ascending)
Select the top k → N_k(x) = {x₍₁₎, x₍₂₎, ..., x₍ₖ₎}

Step 3 — Vote:

Count class frequencies among N_k(x)
ŷ = majority class

Step 4 — Optional: compute probability:

P̂(y = c | x) = (count of class c in N_k(x)) / k

Worked Example:

Training data (2D):
  Point A: (1.0, 2.0) → Class 0
  Point B: (1.5, 1.8) → Class 0
  Point C: (3.0, 4.0) → Class 1
  Point D: (2.8, 3.9) → Class 1
  Point E: (3.2, 3.7) → Class 1

Query point x = (2.0, 3.0), k = 3

Euclidean distances:
  d(x, A) = √((2−1)² + (3−2)²)   = √2    ≈ 1.414
  d(x, B) = √((2−1.5)² + (3−1.8)²) = √1.69 ≈ 1.300
  d(x, C) = √((2−3)² + (3−4)²)   = √2    ≈ 1.414
  d(x, D) = √((2−2.8)² + (3−3.9)²) = √1.45 ≈ 1.204
  d(x, E) = √((2−3.2)² + (3−3.7)²) = √1.93 ≈ 1.389

Sorted: D(1.204), B(1.300), E(1.389), A(1.414), C(1.414)

k=3 nearest: D (Class 1), B (Class 0), E (Class 1)

Vote: Class 0 = 1 vote, Class 1 = 2 votes
Prediction: Class 1
Probability: P(Class 1) = 2/3 ≈ 0.667

5. How kNN "Trains"

5.1 Lazy Learning Explained

kNN has no training phase in the traditional sense. The algorithm does not:

Learn weights or parameters
Build an internal model
Fit a function to the data
Compute gradients or solve an optimization problem

"Training" kNN means exactly one thing: storing the training dataset in memory.

Training time:   O(1)   — just store the data
Prediction time: O(m·p) — compute distances to all m training points across p features
Memory:          O(m·p) — store the entire dataset

This inverts the usual machine learning tradeoff: all the computational cost is deferred to prediction time, not training time. This is the defining characteristic of lazy learning.

5.2 Eager vs. Lazy Learners

Property	Eager Learners (LR, SVM, DT)	Lazy Learners (kNN)
Training phase	Expensive — fit model to all data	Free — just store data
Prediction phase	Cheap — evaluate a fixed function	Expensive — search all training data
Model size	Compact — just parameters	Grows with training data
Adapt to new data	Requires retraining	Just add the new point
Global model	Yes — one function covers all space	No — purely local reasoning
Target function	One global approximation	Infinite local approximations

Key insight: Lazy learners implicitly build infinitely many local models — one for each unique query point. This gives them extraordinary flexibility but at the cost of computational scalability.

6. The Decision Boundary of kNN

6.1 Voronoi Tessellation

With k=1 (1-Nearest Neighbor), the decision boundary is the Voronoi diagram of the training data.

A Voronoi diagram partitions space into regions — one per training point — where each region contains all points closer to that training point than to any other. The boundaries between these regions are the decision boundaries of 1-NN.

        ●           ●
      ·····       ···
   ·····│·····│·····
  ●     │  ■  │     ●
   ·····│·····│·····
      ···       ···
        ■           ■

Each training point "owns" the space nearest to it. The class of the nearest neighbor determines the class of every point in its Voronoi cell.

For k > 1, the boundary becomes smoother — it's no longer a Voronoi diagram but a blending of neighborhoods.

6.2 Effect of k on the Boundary

k = 1:    Every training point is its own region — highly irregular, jagged boundary
k = 3:    Smoother — small clusters of outliers lose influence
k = 10:   Smoother still — only dominant local patterns survive
k = m:    Every query predicts the global majority class — a horizontal line (max smoothing)

Visually:

k=1 boundary:  /\/\/\~~~\/\/\   (noisy, complex, every training point matters)
k=5 boundary:  /‾‾\___/‾‾\     (smoother, local outliers averaged away)
k=50 boundary: ‾‾‾‾‾‾‾‾‾‾‾‾‾   (nearly constant — predicts majority class everywhere)

The Voronoi Insight: With k=1, kNN has zero training error — every training point predicts itself correctly. But this perfect memorization generalizes poorly. Increasing k trades training Accuracy for generalization.

7. Data Structures for Efficient Search

The naive brute-force search is O(m·p) per query. For m = 1,000,000 training points and p = 100 features, this is 100 million floating-point operations per prediction. Specialized data structures make this tractable.

7.1 Brute Force Search

For each query x:
    Compute d(x, xᵢ) for all i = 1, ..., m
    Sort distances
    Return top k

Complexity	Value
Per query	O(m·p)
Total (n queries)	O(n·m·p)

When to use: Small datasets (m < 1,000), or when dimensionality is very high (where tree-based methods degrade). sklearn auto-selects this for high-p or small-m.

7.2 KD-Trees

A KD-tree (k-dimensional tree) is a binary tree that recursively partitions the feature space by splitting along the axis of greatest variance.

Construction:

1. Choose the feature with the greatest spread (e.g., x₁)
2. Split data at the median of x₁ → left and right subtrees
3. Recurse on each subtree, cycling through features
4. Stop when a leaf has ≤ leaf_size points

Search:

1. Traverse down the tree to the leaf containing query x
2. Check all points in that leaf
3. Backtrack up — if the hypersphere of the current best distance
   crosses a splitting boundary, explore the other branch too

Complexity:

Operation	Average case	Worst case
Build	O(m · p · log m)	O(m · p · log m)
Single query	O(log m)	O(m) (high-dim)
n queries	O((m + n) · log m)	O(n · m)

Limitation: KD-trees become inefficient when p > ~20 (the curse of dimensionality — almost all branches must be explored). sklearn recommends KD-trees only for p < 20.

7.3 Ball Trees

A Ball Tree partitions data into nested hyperspheres (balls) rather than hyperrectangles. More efficient than KD-trees in moderate-to-high dimensions.

Construction:

1. Find the feature with the greatest spread
2. Pick the two most distant points as "pivots"
3. Split data: points closer to pivot 1 go left, closer to pivot 2 go right
4. Recurse — each node is a hypersphere enclosing all its points

Search: The triangle inequality allows entire balls to be pruned if the distance to the ball's center minus its radius is already greater than the current best distance.

Complexity:

Operation	Average case	Worst case
Build	O(m · p · log m)	O(m · p · log m)
Single query	O(p · log m)	O(m · p)

Ball trees are generally preferred over KD-trees for p > 20 and structured (non-random) data.

7.4 Approximate Nearest Neighbors

For very large m and moderate p, exact nearest neighbor search is too slow. Approximate Nearest Neighbor (ANN) algorithms trade a small Accuracy loss for dramatic speed gains:

Algorithm	Method	Speed vs. Exact	Accuracy
FAISS (Meta)	GPU-accelerated, IVF + HNSW	100–1000x	~99%
HNSW	Hierarchical Navigable Small World	100x+	~99%
LSH	Locality Sensitive Hashing	10–100x	~95%
Annoy (Spotify)	Random projection trees	10–100x	~95%
ScaNN (Google)	Structured quantization	100x+	~99%

These are the engines behind modern vector databases (Pinecone, Weaviate, Chroma) and recommendation systems — kNN at billion-point scale.

8. The Curse of Dimensionality

This is kNN's most devastating structural limitation — and one of the most important concepts in all of machine learning.

8.1 What It Is

The curse of dimensionality refers to a collection of phenomena that emerge as the number of dimensions p grows large, making intuitive geometric reasoning break down.

Volume of a hypersphere:

V(r, p) = (π^(p/2) / Γ(p/2 + 1)) · rᵖ

As p → ∞, the ratio of the volume of a hypersphere to its enclosing hypercube goes to zero:

V(sphere) / V(cube) = πᵖ/² / (2ᵖ · Γ(p/2 + 1))  →  0   as p → ∞

This means: in high dimensions, almost all the volume is concentrated in the corners of the cube, not the center.

8.2 Why It Destroys kNN

Problem 1 — Everything is equally distant:

In high dimensions, the distance between the nearest and farthest neighbor converges to the same value:

lim_{p→∞} [ d_max(x) − d_min(x) ] / d_min(x) → 0

When all points are approximately equidistant, "nearest neighbor" becomes meaningless — there is no meaningful notion of proximity. The k-nearest neighbors of any query point are essentially a random sample of the training data.

Numerical example:

For points uniformly distributed in [0,1]^p:

p = 2:   To capture 1% of data, need a cube with side ≈ 0.10  (10% of range)
p = 10:  To capture 1% of data, need a cube with side ≈ 0.63  (63% of range)
p = 100: To capture 1% of data, need a cube with side ≈ 0.955 (96% of range!)

In 100 dimensions, finding your 1% nearest neighbors requires looking almost everywhere.

Problem 2 — Exponential data requirement:

To maintain the same density of training data as dimensionality increases, you need exponentially more data:

To keep 10 neighbors in a ball of radius r:
  p = 2:   need ~100 training points
  p = 10:  need ~10¹⁰ training points
  p = 100: need ~10¹⁰⁰ training points (impossible)

8.3 Mitigations

Strategy	Description
Feature selection	Remove irrelevant features before running kNN
Dimensionality reduction	PCA, t-SNE, UMAP, autoencoders — project to a lower-dimensional space
Feature weighting	Weight features by relevance (e.g., learned metric, mutual information)
Metric learning	Learn a task-specific distance metric (Mahalanobis, LMNN, NCA)
Cosine distance	More robust than Euclidean in high dimensions (for normalized vectors)
Use a different algorithm	In truly high-dimensional spaces, tree-based or linear models often win

Rule of thumb: kNN works best when p ≤ ~20 features (or after dimensionality reduction to that range).

9. The Bias-Variance Tradeoff in kNN

The hyperparameter k directly controls the bias-variance tradeoff — but in the opposite direction from most algorithms:

         Low k                          High k
         ↓                               ↓
    Low Bias                         High Bias
    High Variance                    Low Variance
    Complex boundary                 Smooth boundary
    Memorizes noise                  Misses local structure
         |_____________↑_______________|
                   Optimal k
                   (via CV)

k value	Behavior	Risk
k = 1	Predicts exact training labels — zero training error	Overfitting
k = 3–5	Local patterns, some noise tolerance	Slight overfit
k = 15–30	Regional patterns, smooth boundary	Sweet spot
k = √m	Classical rule of thumb for Starting k	Good default
k = m	Predicts global majority class for every point	Underfitting

Mathematical analysis:

With k=1, the 1-NN classifier has a theoretical error rate bounded by:

ε₁ₙₙ ≤ 2 · ε* · (1 − ε*)

Where ε* is the Bayes error rate (the irreducible minimum). Remarkably, 1-NN is at most twice the Bayes error — a non-trivial theoretical guarantee even for the simplest kNN variant.

For large m and k → ∞ with k/m → 0, kNN converges to the Bayes optimal classifier — the best possible classifier for the data distribution. This is the universal consistency of kNN.

10. Assumptions of the Model

Assumption	Description
Continuity / smoothness	Nearby points in feature space share the same label — the core assumption
Meaningful distance metric	The chosen metric must reflect true similarity for the task
Feature relevance	All features contribute to distance; irrelevant features degrade performance
Feature scale homogeneity	Features must be on the same scale — kNN is not scale-invariant
Sufficient data density	The k neighbors of any test point must be genuinely close — requires dense coverage
IID samples	Training points are drawn independently from the same distribution as test points
No strong class imbalance	Voting is majority-based — rare classes are easily overwhelmed

The most frequently violated assumption in practice is feature relevance — including even a handful of noisy, irrelevant features can completely destroy a kNN model's performance by polluting the distance computation.

11. Evaluation Metrics

kNN is evaluated with standard Classification and regression metrics. A few nuances apply:

Classification

Accuracy     = (TP + TN) / Total
[[Precision]]    = TP / (TP + FP)
Recall       = TP / (TP + FN)
[[F1-Score]]     = 2 · [[Precision]] · Recall / ([[Precision]] + Recall)
ROC-AUC      (use predict_proba — class fractions from k neighbors)

Probability calibration note: kNN probabilities (P̂ = count_c / k) are poorly calibrated, especially for small k. With k=3, probabilities can only be 0, 0.33, 0.67, or 1.0 — a very coarse grid. Larger k gives finer probability resolution but introduces more smoothing.

Regression

MSE   = (1/m) · Σ(yᵢ − ŷᵢ)²
RMSE  = √MSE
MAE   = (1/m) · Σ|yᵢ − ŷᵢ|
R²    = 1 − SS_res / SS_tot

kNN-Specific Diagnostics

Diagnostic	What it tells you
Cross-validated k sweep	Plot val Accuracy vs. k to find the optimal neighborhood
Distance distribution plot	Are neighbors truly close? Or far due to curse of dim?
Leave-one-out error (LOO)	For small datasets: exact estimate of 1-NN generalization
Neighbor label distribution	How mixed are the k nearest neighbors? (purity check)

12. Advantages

✅ Conceptual Simplicity

kNN is arguably the most intuitive algorithm in ML. There is nothing to explain except: "find the most similar examples and copy their label." No math required to understand the prediction.

✅ No Training Phase

kNN requires zero optimization time. Training is instantaneous — just store the data. This makes it ideal for:

Streaming scenarios where new data arrives continuously
Rapid prototyping where you need a model immediately
Online learning — adding a new point costs nothing

✅ Non-Parametric — No Assumptions on Data Distribution

kNN makes no assumptions about the functional form of the decision boundary. It adapts to any shape — spirals, concentric circles, irregular blobs — as long as the continuity assumption holds and data is dense enough.

✅ Naturally Handles Multi-Class

No special extension needed. Voting works identically for K classes — just count the most frequent class among the k neighbors.

✅ Adapts Instantly to New Data

Adding new training examples requires no retraining — just insert into the data store (and update the search structure). This is the foundation of case-based reasoning systems and recommendation engines.

✅ Powerful with the Right Distance Metric

With a well-designed distance (or learned metric), kNN can solve remarkably complex problems. The algorithm's quality is entirely determined by the quality of the metric.

✅ Theoretical Guarantees

For large datasets, kNN converges to the Bayes optimal classifier. The 1-NN error rate is bounded above by twice the Bayes error — a hard theoretical guarantee rare in ML.

✅ Works Well on Small Datasets

With very few training samples, complex models overfit easily. kNN's non-parametric nature avoids this — it doesn't impose any structure that could be wrong.

13. Drawbacks & Limitations

❌ Slow Prediction — O(m·p) Per Query

Every prediction requires comparing the query point against all m training points. With m = 1,000,000 and p = 100, that's 100 million operations per query. This is completely unacceptable in production without approximate search structures (Section 7.4).

❌ Memory Scales with Data

The entire training dataset must remain in memory at all times. You cannot compress it into a small set of parameters (unlike SVM's support vectors or a neural network's weights). For m = 10 million high-dimensional points, this means gigabytes of RAM.

❌ Devastated by Irrelevant Features

Every irrelevant feature dilutes the distance computation. With 2 relevant features and 98 noisy features, the 98 noisy dimensions completely dominate distance calculations. The "nearest neighbors" become effectively random.

❌ Curse of Dimensionality

As described in Section 9 — in high dimensions, distances become meaningless, and exponentially more data is needed to maintain local density.

❌ Sensitive to Feature Scale

A feature with range [0, 10,000] overwhelms features with range [0, 1] in Euclidean distance. Always scale before kNN. Forgetting this is the most common practical mistake.

❌ Poor Probability Calibration for Small k

With k=3, only four distinct probabilities exist: {0, 1/3, 2/3, 1}. For tasks requiring well-calibrated probabilities, kNN is a poor choice.

❌ Class Imbalance Vulnerability

Majority voting is naturally biased toward the dominant class. In a 9:1 imbalanced dataset, a query in an ambiguous region will almost always be classified as the majority class.

❌ Optimal k Requires Cross-Validation

There is no formula to compute the optimal k analytically. It must be found by cross-validation — an O(m²) operation for leave-one-out CV.

❌ No Interpretability

kNN produces no coefficients, no rules, no feature importances. The "reason" for a prediction is "these k training points are nearby" — which is rarely satisfying for regulatory, medical, or business contexts.

❌ Struggles Near Decision Boundaries

In ambiguous regions where classes overlap, kNN is the most uncertain — but its k-fraction probability is too coarse to express that uncertainty well.

14. kNN vs. Other Classifiers

Classifier	Interpretable	Non-linear	No Scaling	High-dim	Large Data	Probabilistic
kNN	❌	✅	❌	❌	❌	⚠️
Logistic Regression	✅	❌	❌	✅	✅	✅
SVM (kernel)	❌	✅	❌	✅	❌	❌
Decision Tree	✅	✅	✅	⚠️	✅	⚠️
Random Forest	⚠️	✅	✅	⚠️	✅	✅
Naive Bayes	✅	❌	❌	✅	✅	✅
Neural Network	❌	✅	❌	✅	✅	✅
Gradient Boosting	⚠️	✅	✅	⚠️	✅	✅

15. Practical Tips & Gotchas

Always Scale Features First

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])
pipeline.fit(X_train, y_train)

This is the single most important step. Without scaling, a feature like salary [20k–200k] completely overpowers a feature like age [20–70]. Always use a Pipeline to prevent data leakage from the scaler.

Find Optimal k via Cross-Validation

import numpy as np
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

k_range = range(1, 51)
k_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='Accuracy')
    k_scores.append(scores.mean())

optimal_k = k_range[np.argmax(k_scores)]
print(f"Optimal k: {optimal_k}")

plt.plot(k_range, k_scores)
plt.xlabel('k'); plt.ylabel('Cross-Validated [[Accuracy]]')
plt.title('Elbow curve for k selection')

Look for the elbow — where increasing k stops improving (or hurts) Classification.

Choose the Right Algorithm for Your Dataset Size

KNeighborsClassifier(algorithm='auto')     # Let sklearn decide
KNeighborsClassifier(algorithm='brute')    # Best for p > 20, small m
KNeighborsClassifier(algorithm='kd_tree')  # Best for p < 20, large m
KNeighborsClassifier(algorithm='ball_tree')# Best for p = 20–50, structured data

Leaf size also matters for tree-based algorithms:

KNeighborsClassifier(algorithm='kd_tree', leaf_size=30)  # Default: 30
# Smaller leaf_size → larger tree → faster queries but more memory
# Larger leaf_size → smaller tree → slower queries but less memory

Handle Class Imbalance

kNN's majority vote is biased toward the dominant class. Solutions:

# Option 1: Distance weighting (nearby minority points get more weight)
KNeighborsClassifier(weights='distance')

# Option 2: Oversample the minority class (SMOTE creates synthetic kNN-based examples)
from imblearn.over_sampling import SMOTE
X_res, y_res = SMOTE().fit_resample(X_train, y_train)

# Option 3: Reduce k — smaller neighborhoods are less dominated by majority

Reduce Dimensionality Before kNN

from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=10)),      # Reduce to 10 dims
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

Or use learned metric embeddings for maximum quality:

from sklearn.neighbors import NeighborhoodComponentsAnalysis
nca = NeighborhoodComponentsAnalysis(n_components=10, random_state=42)
X_reduced = nca.fit_transform(X_train, y_train)

NCA (Neighborhood Components Analysis) learns a linear transformation that maximizes kNN [[Accuracy]] directly — state of the art for kNN preprocessing.

Use ANN for Large Datasets

For m > 100,000, switch to approximate nearest neighbor libraries:

import faiss
import numpy as np

# Build index
d = X_train.shape[1]                     # Dimensionality
index = faiss.IndexFlatL2(d)             # L2 distance, exact
index.add(X_train.astype('float32'))

# Query
k = 5
distances, indices = index.search(X_test.astype('float32'), k)
# Then majority-vote indices for predictions

FAISS can handle billion-point datasets with millisecond query times.

Radius-Based Neighbors (Alternative to k)

Instead of fixing k, fix a radius r and use all points within that radius:

from sklearn.neighbors import RadiusNeighborsClassifier

rnc = RadiusNeighborsClassifier(radius=1.0, weights='distance')
rnc.fit(X_train, y_train)

Advantage: Automatically adapts the number of neighbors to local density — dense regions use many neighbors (smooth), sparse regions use few (local).

Disadvantage: In sparse regions, zero neighbors may fall within r — needs an outlier class or fallback strategy.

Diagnose With a Distance Distribution Plot

from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

nn = NearestNeighbors(n_neighbors=5)
nn.fit(X_train)
distances, _ = nn.kneighbors(X_test)

plt.hist(distances[:, -1], bins=50)  # Distance to 5th nearest neighbor
plt.xlabel('Distance to 5th neighbor')
plt.title('Neighbor distance distribution')

If this distribution is very broad (distances range from 0.1 to 100), your features are on different scales or many are irrelevant — scale and reduce before kNN.

15. When to Use It

Use kNN when:

Dataset is small to medium (m < ~100,000 without ANN structures)
Low to moderate dimensionality (p < ~20, or after dimensionality reduction)
No time to train — you need predictions immediately from new data
Data arrives as a stream — online learning is trivial
Decision boundary is highly irregular and no parametric form fits
Multi-class Classification with many classes (voting scales naturally)
Recommendation systems — user-item similarity is the natural metric
Anomaly detection — outliers have few, distant neighbors
You want a fast, non-parametric baseline with zero assumptions
Domain has a natural distance (geospatial, genomic, image features)

Do NOT use kNN when:

Dimensionality is high (p > 20) without aggressive feature selection or reduction
Dataset is large (m > 100,000) without ANN infrastructure
Fast prediction is required in a latency-sensitive production system
Interpretability is required — kNN gives no explanations
Features are categorical and numerous — distance on mixed types is tricky
Calibrated probabilities are needed for risk estimation
Class imbalance is severe — majority voting will fail the minority class

Summary

┌────────────────────────────────────────────────────────────────────┐
│                    kNN AT A GLANCE                                 │
├────────────────────────────────────────────────────────────────────┤
│  CORE IDEA      Find k most similar training points → vote         │
│  TRAINING       O(1) — store the data, nothing else               │
│  PREDICTION     O(m·p) brute force, O(log m) with KD/Ball tree    │
│  DECISION       Majority class (or weighted vote) of k neighbors   │
│  GEOMETRY       Voronoi diagram (k=1) → smooth blending (k>1)     │
│  STRENGTH       Non-parametric, online learning, multiclass        │
│  WEAKNESS       Slow at scale, curse of dimensionality, no interp │
│  KEY HYPERP.    k, distance metric, weighting scheme              │
│  MUST DO        Always scale features. Always.                     │
│  BEST FOR       Low-dim dense data, streaming, anomaly detection   │
└────────────────────────────────────────────────────────────────────┘

kNN is the purest expression of a philosophical position in machine learning: don't generalize, just remember. No abstraction, no compression, no model — just the raw data, sitting in memory, waiting to answer by analogy. This makes it uniquely fragile in high dimensions and uniquely powerful in dense, low-dimensional spaces. It is the algorithm that powers recommendation systems, vector search, and anomaly detection at scale — not as a toy, but as infrastructure. Master kNN and you master the concept of similarity itself, which turns out to be the foundation of far more of machine learning than it first appears.