FIG. 13 · ATLASRidge / Lasso

Linear, but
disciplined.

Linear models with an L2 (ridge) or L1 (lasso) penalty on the coefficients. Lasso zeroes out unhelpful features automatically — feature selection as a side effect of the loss function.

Below: a degree-7 polynomial fitted to a wiggly dataset. Move lambda to crank up regularization. Toggle between Ridge (which shrinks every coefficient toward zero smoothly) and Lasso (which pushes some all the way to zero). Watch the curve straighten and the bars collapse.

Filed

Engine

Pure JS · gradient + soft-thresholding

Source

ridge-lasso-lab.js ↗

§ ICoefficients, shrunk live

Bars on the right show coefficient magnitude. Lasso at high lambda zeroes most of them — the useful ones survive.

degree-7 polynomial fit

§ IIHow it works

Ridge adds λ·Σwⱼ² to the loss; Lasso adds λ·Σ|wⱼ|. The difference matters more than it looks. The squared term has a continuous derivative and shrinks all coefficients smoothly. The absolute-value term has a kink at zero that — combined with the optimization — pushes coefficients exactly to zero rather than near-zero.

That zero-pinning is automatic feature selection. Lasso is the reason a model fitted to thousands of features can come out with a sparse explanation that uses only twenty. Ridge can't do that; every coefficient stays nonzero, just smaller.

Elastic Net combines both penalties. It's what scikit-learn's ElasticNet defaults to and what most production teams use when they want grouped feature selection (Lasso alone is unstable when features are correlated).

The math

Ridge:

min_w Σ_i (y_i − w·x_i)² + λ Σ_j w_j²

Lasso:

min_w Σ_i (y_i − w·x_i)² + λ Σ_j |w_j|

Ridge has a closed form: ŵ = (XᵀX + λI)⁻¹ Xᵀy. Lasso doesn't — the absolute-value term breaks differentiability at zero. Solved with coordinate descent or proximal gradient methods (this demo uses soft-thresholded gradient steps).

§ IIIWhere it shines, where it breaks

Shines

High-dimensional data

Genomics, text, images-as-features — whenever you have many more candidate predictors than samples, regularization is essential. Lasso-based feature selection turns "thousands of correlated genes" into "the eight that drive the outcome."

Shines

Multi-collinearity

When features are highly correlated, OLS coefficients explode. Ridge stabilizes them by trading bias for variance. The model becomes less interpretable per-coefficient, but the predictions become trustworthy.

Breaks

Non-linear relationships

Ridge / Lasso are still linear models. Polynomial features (as in this demo) extend their reach but blow up in high dimensions. For genuine non-linearity, switch to trees, kernels, or neural networks.

Breaks

Lambda is a guess

The "right" lambda depends on your data and your loss surface. Cross-validation finds a value that minimizes held-out error. Without CV you're picking lambda by feel, which usually means too small (underfit by overfit).

§ IVTrade-off scorecard

Inference0.95
Accuracy0.65
Training0.90
Small size0.95

§ VIn production

Insurance underwriting at GEICO and Progressive. Generalized Linear Models with elastic-net regularization underwrite auto policies across hundreds of millions of risk-rated drivers. The L1 component performs automatic actuarial feature selection across thousands of candidate predictors; the L2 component stabilizes coefficients across correlated demographic and driving-history features. Black-box gradient-boosted models score better but the regulatory-friendly elastic-net is what ships.

§ VICompare to

Linear Regression

No regularization

Logistic Regression

Same idea, classification

Gradient Boosting

Higher accuracy · less interpretable

Try the wizard again →