FIG. 13 · ATLASSupport Vector Machine
The widest
margin wins.
Find the hyperplane that separates classes with the largest possible margin. The points sitting on that margin — the support vectors are the only points the model needs — everything else is throw-away.
Press Run to watch the linear soft-margin SVM converge via gradient descent on hinge loss. The two dashed lines are the margin; points on or inside it are circled as support vectors. The C slider controls how much we tolerate margin violations — high C means hard boundary, low C means smoother but with more violations.
§ IThe margin, drawn live
Solid red is the decision boundary. Dashed lines on either side are the margin. Larger circles are support vectors.
§ IIHow it works
The SVM finds the hyperplane that maximizes the perpendicular distance to the nearest points of either class. Those nearest points are the support vectors — the model can be reconstructed from them alone. Discard every other training point and the boundary is unchanged.
Soft-margin SVM allows some points to violate the margin in exchange for a smoother boundary, controlled by a regularization parameter (here labeled C). High C means low tolerance for violations — effectively a hard margin if the data is separable. Low C lets violations through to keep the boundary simple.
Kernel SVM lifts the linear method to non-linear boundaries by mapping the data into a higher-dimensional space implicitly. The RBF kernel is the canonical choice. This demo is linear-only; the circles preset shows what linear SVM gets wrong.
The math
The primal soft-margin objective:
min_(w, b) ½‖w‖² + C Σ_i max(0, 1 − y_i(w·x_i + b))The first term shrinks weights (regularization). The second is hinge loss — zero when a point is correctly classified beyond the margin, growing linearly inside it.
This demo runs sub-gradient descent on the primal directly. Production SVMs use SMO or quadratic-programming solvers for tighter convergence and kernels via the dual formulation.
§ IIIWhere it shines, where it breaks
Small high-dimensional datasets
The number of parameters is roughly the number of support vectors, which is often a fraction of the training set. SVMs are famously strong on text classification, where features are tens of thousands of dimensions and examples are tens of thousands.
Sparse, expensive labels
If labels are costly (medical, scientific) and the dataset stays under a few thousand points, SVM's margin objective often outperforms tree ensembles that assume more data.
Big data
Training cost scales between O(n²) and O(n³). At a million points the wait is hours; at ten million it's days. Linear SVM via stochastic gradient descent (LibLinear) extends the range, but kernel SVM hits a wall.
Multi-class natively
SVM is a binary classifier. Multi-class is faked with one-vs-rest or one-vs-one schemes that scale poorly past a few dozen classes. For n-way problems with many classes, neural networks or tree ensembles are simpler.
§ IVTrade-off scorecard
- Inference0.65
- Accuracy0.80
- Training0.55
- Small size0.70
§ VIn production
Spam filtering at Gmail (early generations). Linear SVM on bag-of-words features was the workhorse of email spam classification for the better part of a decade — tens of thousands of features, billions of examples streamed through LibLinear, daily retrains. Modern Gmail uses neural networks; the SVM era is what made the modern era possible.
§ VICompare to
Logistic Regression
Same shape, different loss
KNN
Local, not global
Neural Network (MLP)
Non-linear · phase 3