FIG. 13 · ATLASGradient Boosting
Trees, but each
one fixes the last.
Add trees one at a time, each fitted to the residual error of the ensemble so far. XGBoost and LightGBM dominate Kaggle leaderboards for a reason: a sequence of weak learners, taken together, is rarely beaten on tabular data.
Move the iteration slider to watch the boundary form. Each step adds a small decision tree that fits the residuals of the cumulative prediction. Lower learning rate means slower but smoother learning. Depth controls how flexible each individual tree is.
§ IThe ensemble, growing
Move iterations from 1 to 100. Each step adds one tree fitted to current residuals.
§ IIHow it works
Start with a constant prediction (zero, or the class mean). Compute the residual at each training point: how far off is the current prediction? Fit a small decision tree to predict those residuals. Add it to the ensemble, scaled by the learning rate. Repeat.
The trick is that each tree fixes the previous ensemble's mistakes. Random Forest averages independent trees. Gradient Boosting chains them. The dependency means boosting trains slower (you can't parallelize across trees the way you can with a forest) but the sequential corrections produce tighter fits with the same total tree budget.
Modern implementations — XGBoost, LightGBM, CatBoost — add regularization, smarter split-finding, histogram-based binning, and second-order gradient information. The algorithm in this demo is the basic version; production libraries are several generations more sophisticated.
The math
For squared-error loss, given predictions F_m(x) after m iterations:
r_i^(m+1) = y_i − F_m(x_i)Fit a regression tree h_(m+1) to the residuals, then update:
F_(m+1)(x) = F_m(x) + ν · h_(m+1)(x)where ν is the learning rate. After M iterations, the prediction is the cumulative sum of all tree outputs scaled by ν. For classification, replace squared error with log-loss and the residuals become probability gradients.
§ IIIWhere it shines, where it breaks
Tabular accuracy ceiling
For mixed-type tabular data — the realm of finance, e-commerce, retention — gradient boosting is roughly the ceiling of what's achievable without deep learning. Often beats neural networks at the same problem.
Robust to dirty data
Trees handle missing values, mixed scales, irrelevant features, monotonic constraints. XGBoost in particular has direct support for missing-value branches and feature interaction constraints.
Sequential training
Trees can't be built in parallel because each depends on the residuals from the last. LightGBM mitigates this with histogram binning, but the fundamental dependency is real. Random forests are dramatically faster to train.
Overfit risk on noisy labels
Try the noisy preset above and crank n_estimators to 100. The boundary memorizes the noise. Cross-validation, early stopping, and regularization (γ, λ in XGBoost) are not optional in production.
§ IVTrade-off scorecard
- Inference0.55
- Accuracy0.90
- Training0.45
- Small size0.40
§ VIn production
Airbnb's price recommendations. A multi-stage gradient-boosted system predicts host-market-pricing suggestions across millions of listings worldwide. The model consumes hundreds of features — seasonality, neighborhood, photo quality scores, local events — and outputs a single nightly rate. XGBoost was the first algorithm to beat Airbnb's hand-tuned pricing heuristics, and it's been the workhorse since.
§ VICompare to
Random Forest
Parallel trees · faster training
Decision Tree
Single tree · interpretable
Neural Network (MLP)
Deep learning · phase 3