FIG. 13 · ATLASSupport Vector Machine

The widest
margin wins.

Find the hyperplane that separates classes with the largest possible margin. The points sitting on that margin — the support vectors are the only points the model needs — everything else is throw-away.

Press Run to watch the linear soft-margin SVM converge via gradient descent on hinge loss. The two dashed lines are the margin; points on or inside it are circled as support vectors. The C slider controls how much we tolerate margin violations — high C means hard boundary, low C means smoother but with more violations.

Filed

Engine

Pure JS · gradient descent on hinge loss

Source

svm-lab.js ↗

§ IThe margin, drawn live

Solid red is the decision boundary. Dashed lines on either side are the margin. Larger circles are support vectors.

support vectors circled

§ IIHow it works

The SVM finds the hyperplane that maximizes the perpendicular distance to the nearest points of either class. Those nearest points are the support vectors — the model can be reconstructed from them alone. Discard every other training point and the boundary is unchanged.

Soft-margin SVM allows some points to violate the margin in exchange for a smoother boundary, controlled by a regularization parameter (here labeled C). High C means low tolerance for violations — effectively a hard margin if the data is separable. Low C lets violations through to keep the boundary simple.

Kernel SVM lifts the linear method to non-linear boundaries by mapping the data into a higher-dimensional space implicitly. The RBF kernel is the canonical choice. This demo is linear-only; the circles preset shows what linear SVM gets wrong.

The math

The primal soft-margin objective:

min_(w, b) ½‖w‖² + C Σ_i max(0, 1 − y_i(w·x_i + b))

The first term shrinks weights (regularization). The second is hinge loss — zero when a point is correctly classified beyond the margin, growing linearly inside it.

This demo runs sub-gradient descent on the primal directly. Production SVMs use SMO or quadratic-programming solvers for tighter convergence and kernels via the dual formulation.

§ IIIWhere it shines, where it breaks

Shines

Small high-dimensional datasets

The number of parameters is roughly the number of support vectors, which is often a fraction of the training set. SVMs are famously strong on text classification, where features are tens of thousands of dimensions and examples are tens of thousands.

Shines

Sparse, expensive labels

If labels are costly (medical, scientific) and the dataset stays under a few thousand points, SVM's margin objective often outperforms tree ensembles that assume more data.

Breaks

Big data

Training cost scales between O(n²) and O(n³). At a million points the wait is hours; at ten million it's days. Linear SVM via stochastic gradient descent (LibLinear) extends the range, but kernel SVM hits a wall.

Breaks

Multi-class natively

SVM is a binary classifier. Multi-class is faked with one-vs-rest or one-vs-one schemes that scale poorly past a few dozen classes. For n-way problems with many classes, neural networks or tree ensembles are simpler.

§ IVTrade-off scorecard

Inference0.65
Accuracy0.80
Training0.55
Small size0.70

§ VIn production

Spam filtering at Gmail (early generations). Linear SVM on bag-of-words features was the workhorse of email spam classification for the better part of a decade — tens of thousands of features, billions of examples streamed through LibLinear, daily retrains. Modern Gmail uses neural networks; the SVM era is what made the modern era possible.

§ VICompare to

Logistic Regression

Same shape, different loss

KNN

Local, not global

Neural Network (MLP)

Non-linear · phase 3

Try the wizard again →