Support Vector Machines (SVM)

Pasted image 20260319232345.png

Support Vector Machines — Deep Analysis

"Find the boundary that is most wrong about nothing."

1. What Is a Support Vector Machine?

A Support Vector Machine (SVM) is a supervised machine learning algorithm primarily used for Classification, though it extends naturally to regression (SVR) and anomaly detection. It was developed by Vladimir Vapnik and colleagues at Bell Labs in the 1990s, rooted in the theoretical framework of Structural Risk Minimization and VC dimension from statistical learning theory.

The central idea: rather than just finding any boundary that separates classes, SVM finds the boundary that maximizes the margin — the gap between the two classes. This makes it the most geometrically principled classifier in classical machine learning.

2. The Core Geometric Intuition

Imagine two classes of points plotted on a 2D plane — red circles and blue squares. Many lines could separate them. Which one should you choose?

  Blue  |      | Red
   ●    |      |    ■
  ●     |      |   ■
    ●   |  ←M→ |  ■
   ●    |      |    ■

SVM's answer: Choose the line that is furthest from both classes simultaneously — the one that maximizes the margin M. This is not just aesthetically pleasing; it has a deep justification:

A wider margin means the classifier generalizes better to unseen data
Points near the boundary are the most dangerous — a wide margin gives them more room
Maximizing margin is equivalent to minimizing generalization error (Vapnik's bound)

The points that sit exactly on the margin boundary and define it are called support vectors — the algorithm's namesake. Everything else in the dataset is irrelevant; remove any non-support-vector point and the solution doesn't change.

3. Mathematical Foundation

3.1 The Hyperplane

A hyperplane in n-dimensional space is defined by:

wᵀx + b = 0

Where:

w is the weight vector (normal to the hyperplane)
b is the bias (shifts the hyperplane from the origin)
x is a data point

For a 2D problem, this reduces to the familiar w₁x₁ + w₂x₂ + b = 0 — a line.

The signed distance of a point x from the hyperplane is:

distance = (wᵀx + b) / ||w||

This sign tells us which side of the hyperplane the point lies on — critical for Classification.

3.2 The Margin

We define two margin hyperplanes (gutters), one on each side of the decision boundary:

wᵀx + b = +1   (positive margin plane)
wᵀx + b = −1   (negative margin plane)

The geometric distance between these two planes is:

Margin M = 2 / ||w||

Maximizing M is equivalent to minimizing ||w|| (or ||w||² for mathematical convenience). This is the core of the optimization problem.

3.3 Support Vectors

Support vectors are the training points that satisfy:

wᵀxᵢ + b = +1   (closest positive-class points)
wᵀxᵢ + b = −1   (closest negative-class points)

These are the only points that matter for defining the decision boundary. In practice, typically only 5–30% of training data are support vectors.

This sparsity is a crucial property: SVM is a sparse model, defined entirely by a small subset of the training data.

3.4 The Optimization Problem (Hard Margin)

For perfectly linearly separable data, the optimization problem is:

Minimize:   (1/2) ||w||²

Subject to: yᵢ(wᵀxᵢ + b) ≥ 1   for all i = 1, ..., m

Where yᵢ ∈ {−1, +1} are the class labels (note: SVM uses ±1, not 0/1).

The constraint yᵢ(wᵀxᵢ + b) ≥ 1 means:

For positive-class points (y=+1): wᵀx + b ≥ +1 ✅
For negative-class points (y=−1): wᵀx + b ≤ −1 ✅

This is a convex quadratic program — it has a unique global minimum and can be solved efficiently.

4. Soft Margin SVM

Real-world data is almost never perfectly linearly separable. Hard margin SVM fails when:

Classes overlap (some misClassification is inevitable)
Data has outliers that would completely distort the boundary
A rigid boundary would have near-zero margin to achieve perfect separation

4.1 Slack Variables

We introduce slack variables ξᵢ ≥ 0 for each training point, allowing limited violations of the margin:

yᵢ(wᵀxᵢ + b) ≥ 1 − ξᵢ

Interpretation of ξᵢ:

ξᵢ value	Meaning
ξᵢ = 0	Point is correctly classified, outside or on margin
0 < ξᵢ ≤ 1	Point is inside the margin but correctly classified
ξᵢ > 1	Point is misclassified (on the wrong side)

4.2 The C Hyperparameter

The modified optimization problem becomes:

Minimize:   (1/2)||w||² + C · Σ ξᵢ

Subject to: yᵢ(wᵀxᵢ + b) ≥ 1 − ξᵢ,   ξᵢ ≥ 0

C controls the bias-variance tradeoff:

C value	Behavior	Risk
C → ∞	No violations tolerated → hard margin	Overfitting
C large	Few violations → narrow margin, complex boundary	Overfitting
C small	Many violations allowed → wide margin, simpler fit	Underfitting
C → 0	Ignores all constraints → trivial solution	Underfitting

Intuition: C asks "How much do I care about misclassifying training points?" High C = care a lot. Low C = prioritize a wide margin even if it costs some misClassifications.

Note: In sklearn, the hinge loss formulation uses C inversely to regularization — C = 1/λ.

5. The Kernel Trick

This is where SVM goes from a clever linear classifier to one of the most powerful non-linear classifiers in classical ML.

5.1 Why Kernels?

When data is not linearly separable in its original space, the intuition is:

Map the data into a higher-dimensional space where it is linearly separable.

Original 2D space (not separable)  →  Map to 3D  →  Now separable with a plane

A classic example: XOR data. Points at (0,0), (1,1) are Class A; (1,0), (0,1) are Class B. No line separates them in 2D. But add the feature x₃ = x₁·x₂ → the classes become linearly separable in 3D.

The problem: Mapping to very high (or infinite) dimensional spaces and computing dot products there is computationally catastrophic.

5.2 How the Kernel Trick Works

The SVM optimization (in its dual form — see Section 6) only ever uses data points through dot products xᵢᵀxⱼ.

The kernel trick replaces this with a kernel function:

K(xᵢ, xⱼ) = φ(xᵢ)ᵀ · φ(xⱼ)

Where φ(x) is the (potentially infinite-dimensional) feature mapping.

The magic: we never compute φ(x) explicitly. We only compute K(xᵢ, xⱼ) in the original input space, which is fast and efficient — but mathematically equivalent to operating in the transformed space.

Computation in original space:  K(xᵢ, xⱼ)   ← cheap
Implicit representation:         φ(xᵢ)ᵀφ(xⱼ) ← infinitely expensive if computed directly

A function K is a valid kernel if and only if it satisfies Mercer's condition: the kernel matrix K (where Kᵢⱼ = K(xᵢ, xⱼ)) must be positive semi-definite. This guarantees the kernel corresponds to some dot product in some feature space.

5.3 Common Kernel Functions

Linear Kernel

K(xᵢ, xⱼ) = xᵢᵀxⱼ

No transformation. Equivalent to standard SVM. Use when data is already linearly separable.

Polynomial Kernel

K(xᵢ, xⱼ) = (γ · xᵢᵀxⱼ + r)^d

d = degree of polynomial
r = constant term (coef0 in sklearn)
Captures polynomial interactions between features
d=2 adds quadratic features; d=3 adds cubic; etc.

Radial Basis Function (RBF) / Gaussian Kernel

K(xᵢ, xⱼ) = exp(−γ · ||xᵢ − xⱼ||²)

The most widely used kernel
Implicitly maps to infinite-dimensional space
Measures similarity as a Gaussian decay of Euclidean distance
γ controls the "reach" of each support vector

γ value	Effect
γ large	Narrow Gaussian → complex, wiggly boundary
γ small	Wide Gaussian → smooth, simple boundary

Intuition: Each support vector "broadcasts" its influence over a Gaussian-shaped region. High γ = narrow influence = the model memorizes local structure.

Sigmoid Kernel

K(xᵢ, xⱼ) = tanh(γ · xᵢᵀxⱼ + r)

Resembles a neural network with one hidden layer. Not always positive semi-definite — use with caution.

5.4 Choosing a Kernel

Start → Is data high-dimensional and sparse?   → YES → Linear kernel
         ↓ NO
         Do you have domain knowledge of feature interactions? → YES → Polynomial
         ↓ NO
         Default: RBF kernel (grid-search γ and C)
         ↓
         Only try Sigmoid if motivated by neural network analogy

Rule of thumb: Always try Linear first (it's fastest). If performance is insufficient, try RBF with cross-validated hyperparameters.

6. How SVM Is Trained — The Dual Problem

6.1 Lagrangian Duality

SVM's training is formulated as a constrained optimization problem. The standard approach uses Lagrange multipliers to convert it into an unconstrained form called the Lagrangian:

L(w, b, α) = (1/2)||w||² − Σᵢ αᵢ[yᵢ(wᵀxᵢ + b) − 1]

Where αᵢ ≥ 0 are Lagrange multipliers — one per training example.

Setting partial derivatives to zero yields:

w = Σᵢ αᵢ yᵢ xᵢ        (weight vector as linear combo of training points)
Σᵢ αᵢ yᵢ = 0            (bias constraint)

Substituting back gives the dual problem:

Maximize:   Σᵢ αᵢ − (1/2) Σᵢ Σⱼ αᵢ αⱼ yᵢ yⱼ (xᵢᵀxⱼ)

Subject to: αᵢ ≥ 0,   Σᵢ αᵢ yᵢ = 0

Key insight: the dual depends on data only through dot products xᵢᵀxⱼ — exactly where kernels plug in.

6.2 KKT Conditions

The Karush-Kuhn-Tucker (KKT) conditions are the necessary and sufficient conditions for optimality:

αᵢ = 0        →  Point is NOT a support vector (correctly classified, outside margin)
0 < αᵢ < C   →  Point IS a support vector (exactly on the margin boundary)
αᵢ = C        →  Point violates the margin (inside margin or misclassified)

This gives SVM its sparsity: only points with αᵢ > 0 (support vectors) contribute to w. All other training points can be discarded post-training.

Prediction for a new point x:

f(x) = sign(Σᵢ αᵢ yᵢ K(xᵢ, x) + b)

This is computed using only the support vectors — elegant and efficient at inference time.

6.3 Solvers

Solver	Algorithm	Best For
SMO	Sequential Minimal Optimization	General purpose (used in libSVM)
liblinear	Coordinate descent	Large-scale linear SVM
SDCA	Stochastic Dual Coordinate Ascent	Large datasets, online learning
Cutting-plane	Structural SVM	Structured output problems

SMO (Platt, 1998) is the workhorse. It decomposes the QP into the smallest possible subproblems (pairs of αᵢ), each with a closed-form solution, iterating until convergence.

Time complexity: O(m²) to O(m³) for kernel SVM — this is the main scalability bottleneck.

7. SVM for Multiclass Classification

SVM is inherently a binary classifier. Two strategies extend it to K classes:

One-vs-One (OvO)

Train K(K−1)/2 binary classifiers — one for each pair of classes. Predict by majority vote.

For K=4: train classifiers for (1v2), (1v3), (1v4), (2v3), (2v4), (3v4)
→ 6 classifiers, each votes → take the majority

Advantage: Each classifier is trained on a small, balanced subset of data.
Disadvantage: Number of classifiers grows quadratically with K.

One-vs-Rest (OvR)

Train K binary classifiers: each class vs. all others. Pick the class with the highest decision score.

Advantage: Fewer classifiers (K vs K²/2).
Disadvantage: Training sets become imbalanced (1 class vs. K-1 classes).

sklearn default: OvO for SVC, OvR for LinearSVC.

8. Assumptions of the Model

Assumption	Description
Separability (with margin)	Classes should be separable, at least with soft margin tolerance
Feature relevance	SVM assumes all features contribute — irrelevant features degrade the margin
No native probability calibration	Raw SVM scores are not probabilities; Platt scaling must be applied separately
Scale Sensitivity	Features must be on the same scale; SVM is not scale-invariant
Feature independence not required	Unlike Naive Bayes, SVM handles correlated features naturally
Appropriate kernel choice	The kernel must match the true structure of the data

9. Evaluation Metrics

Since SVM is a classifier, standard Classificationon metrics]] apply. Note one critical difference from Logistic Regression:

SVM Has No Native Probabilities

Raw SVM produces a decision score f(x) = wᵀx + b, not a probability. To get probabilities:

# Platt scaling — fits a sigmoid to the SVM scores
model = SVC(probability=True)  # Adds cross-validation overhead

Platt scaling trains a Logistic Regression on top of SVM's decision scores. It works but adds computational cost and can be poorly calibrated.

Decision Function vs. Threshold

Predict +1  if  f(x) > 0
Predict −1  if  f(x) < 0

You can shift the threshold by adjusting b (the bias). This is useful in imbalanced settings.

Standard Metrics

Accuracy    = (TP + TN) / Total
[[Precision]]   = TP / (TP + FP)
Recall      = TP / (TP + FN)
[[F1-Score]]    = 2 · [[Precision]] · Recall / ([[Precision]] + Recall)

Hinge Loss

The native loss function of SVM, useful for monitoring training:

L_hinge = max(0, 1 − yᵢ · f(xᵢ))

If yᵢ · f(xᵢ) ≥ 1: point is correctly classified with margin → zero loss
If yᵢ · f(xᵢ) < 1: point violates margin → positive loss

This is a convex upper bound on 0/1 loss. It is differentiable everywhere except at the "hinge" point.

10. Advantages

✅ Effective in High-Dimensional Spaces

SVM works extremely well when the number of features exceeds the number of samples (p >> n). Text Classification (tf-idf feature spaces) is a canonical example where SVM has historically dominated.

✅ Memory Efficient (Sparse Model)

The model is defined only by the support vectors — often a small fraction of training data. At prediction time, only support vectors participate in computation.

✅ Versatile via Kernels

By swapping kernels, SVM can model radically different decision boundaries — from linear to polynomial curves to Gaussian blobs — using the same underlying framework.

✅ Global Optimum Guaranteed

The objective is strictly convex (quadratic program). There are no local minima — gradient-based solvers always find the unique global solution.

✅ Strong Theoretical Guarantees

SVM minimizes the VC-dimension bound on generalization error. The margin is a direct measure of the model's generalization capacity — wider margin = lower VC dimension = better expected test performance.

✅ Robust to Overfitting in High Dimensions

Unlike many other models, SVM's complexity is controlled by the number of support vectors (not the number of features). It can generalize well even with many features.

✅ Works Well with Clear Margin Separation

When a clear gap exists between classes, SVM finds the optimal boundary with remarkable reliability.

✅ Kernel Flexibility

The kernel abstraction lets SVM operate on non-vectorial data — strings, graphs, sets — as long as a valid kernel can be defined.

11. Drawbacks & Limitations

❌ Scalability — The Achilles Heel

Training complexity is O(m²) to O(m³) in the number of training samples. For m = 100,000 points, this is computationally prohibitive with standard kernel SVM. Linear SVM (liblinear) scales to millions, but kernel SVM struggles beyond tens of thousands.

❌ No Native Probabilistic Output

SVM does not naturally output calibrated probabilities. Platt scaling helps but adds overhead and can be unreliable, especially with small datasets.

❌ Kernel and Hyperparameter Sensitivity

Choosing the right kernel and tuning C, γ, d, etc. requires expensive cross-validation. A poor kernel choice can result in very poor performance.

❌ Difficult to Interpret

Unlike Logistic Regression where coefficients have clear meaning, SVM coefficients in the dual space are not directly interpretable. With non-linear kernels, the model is essentially a black box.

❌ Sensitive to Feature Scaling

SVM with RBF kernel computes Euclidean distances. Features with large ranges dominate. Always scale before training.

❌ Sensitive to Noisy Data

SVM tries to maximize margin, which makes it sensitive to mislabeled points or overlapping distributions. Every noisy point that becomes a support vector directly distorts the boundary.

❌ Class Imbalance Handling is Manual

SVM does not naturally handle imbalanced classes. You must use class_weight='balanced' or resampling techniques.

❌ No Online Learning (for Kernel SVM)

Standard SVM requires the entire training set to be in memory. Incremental/online updates are non-trivial (though online SVM variants exist).

12. SVM vs. Other Classifiers

Classifier	Non-linear	Probabilistic	Interpretable	Large Data	Scalable
SVM (kernel)	✅	❌	❌	❌	❌
SVM (linear)	❌	❌	⚠️	⚠️	✅
Logistic Regression	❌	✅	✅	✅	✅
Random Forest	✅	✅	❌	✅	✅
Neural Network	✅	✅	❌	✅	✅
k-NN	✅	⚠️	⚠️	❌	❌
Naive Bayes	❌	✅	✅	✅	✅
Decision Tree	✅	⚠️	✅	✅	✅

13. Practical Tips & Gotchas

Always Scale Your Features

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

With RBF kernel, unscaled features make the kernel meaningless. Feature with range [0, 1000] will dominate feature with range [0, 1].

Hyperparameter Tuning — The Core Workflow

For RBF kernel, always tune both C and γ together:

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

param_grid = {
    'C':     [0.01, 0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.001, 0.01, 0.1],
    'kernel': ['rbf', 'poly', 'linear']
}

grid = GridSearchCV(SVC(), param_grid, cv=5, scoring='f1', n_jobs=-1)
grid.fit(X_train, y_train)

[[Start]] coarse, then fine-tune. A log-scale grid (powers of 10) is standard.

Getting Probabilities

model = SVC(probability=True)   # Enables Platt scaling
probs = model.predict_proba(X_test)

⚠️ This adds significant training time due to cross-validation for Platt scaling. Use decision_function() instead when probabilities are not needed.

Handling Class Imbalance

model = SVC(class_weight='balanced')
# or custom weights:
model = SVC(class_weight={0: 1, 1: 5})

When Your Dataset Is Large

For m > 10,000 samples, avoid kernel SVM. Use:

from sklearn.svm import LinearSVC          # Scales to millions
from sklearn.linear_model import SGDClassifier(loss='hinge')  # Online learning

Or use approximate kernels for a compromise:

from sklearn.kernel_approximation import RBFSampler
rbf_feature = RBFSampler(gamma=0.1, n_components=300)
X_approx = rbf_feature.fit_transform(X_train)
# Then train LinearSVC on X_approx

This approximates the RBF kernel with explicit (random) features — O(m) training time.

Diagnosing With the Decision Function

scores = model.decision_function(X_test)
# Positive scores → Class +1, Negative → Class −1
# |score| indicates confidence — magnitude, not probability

Plot the distribution of scores for each class to understand the margin and overlap.

Number of Support Vectors as a Health Check

print(model.support_vectors_.shape)
print(model.n_support_)

Very few SVs (< 5% of data): Model may be too simple (low C or RBF too broad)
Many SVs (> 50% of data): Model may be overfitting or data has poor separability

14. When to Use It

Use SVM when:

You have a small-to-medium dataset (up to ~50,000 samples)
Dimensionality is high and samples are relatively few (text Classification, bioinformatics)
The margin between classes is clear (even if not perfectly separable)
You need non-linear Classification and have time for kernel tuning
You want a theoretically principled model with strong generalization bounds
Working with non-standard data types (graphs, strings) with a custom kernel
[[Accuracy]] matters more than interpretability

Do NOT use SVM when:

Your dataset has more than ~50,000 samples (unless using LinearSVC or SGD)
You need calibrated probabilities directly from the model
You need fast retraining on streaming or changing data
Interpretability is required by stakeholders or regulations
You have many irrelevant or noisy features without prior selection
You're working with images, audio, or text at scale — use deep learning

Summary

┌───────────────────────────────────────────────────────────────┐
│                  SVM AT A GLANCE                              │
├───────────────────────────────────────────────────────────────┤
│  CORE IDEA      Max-margin hyperplane between classes         │
│  KEY MATH       Convex QP via Lagrangian duality              │
│  TRAINING       SMO solver on dual problem                    │
│  DECISION       sign(Σ αᵢ yᵢ K(xᵢ, x) + b)                  │
│  POWER MOVE     Kernel trick → non-linear without mapping φ   │
│  STRENGTH       High-dim, small data, global optimum          │
│  WEAKNESS       Slow on large data, no probabilities          │
│  HYPERPARAMS    C (margin tolerance) + kernel params          │
│  BEST FOR       Text, bioinformatics, clean [[Classification]]     │
└───────────────────────────────────────────────────────────────┘