Vol. XII · No. 05 · May 2026
Jake Cuth.
← from the Model Atlas

Bayes' theorem,
applied bluntly.

Apply Bayes' theorem under the assumption that features are conditionally independent. Wrong assumption, surprisingly useful answer — especially when text is the data.

Click anywhere to add a labeled point. The dashed ellipses show the 1-sigma boundary of each class's fitted Gaussian. The decision surface is wherever the posterior of the two classes is equal — a quadratic curve, not a straight line.


Each class's mean and per-axis variance is estimated from its points. Predict by comparing log-likelihoods plus the class prior.

dashed ellipses = 1-sigma per class

Bayes' theorem says posterior ∝ likelihood × prior. For classification, predict whichever class has the highest posterior probability given the features. Naive Bayes makes one ruthless simplification: assume features are conditionally independent given the class. That assumption is almost always wrong — hence "naive" — but it makes the math trivial and the model fast.

For continuous features (this demo), each feature within a class is modeled as a univariate Gaussian. Multinomial Naive Bayes does the same with word counts and is famously strong on text. The two variants share the same skeleton: estimate per-class likelihoods independently and multiply them at prediction time.

The naive assumption breaks reality, but its effect on argmax classification is often muted. Even when the probabilities are wildly miscalibrated, the relative ordering (which class is most likely?) is frequently correct. That's why Naive Bayes survives in spam filters, sentiment classifiers, and document categorization decades after better-calibrated models existed.

The math

For features x = (x_1, …, x_d) and classes C_k:

P(C_k | x) ∝ P(C_k) · ∏_(j=1..d) P(x_j | C_k)

For Gaussian features, P(x_j | C_k) = N(x_j; μ_(jk), σ²_(jk)) — one mean and one variance per (feature, class) pair.

The decision boundary in 2D is the curve where two posteriors are equal:

log P(C_0)P(x | C_0) = log P(C_1)P(x | C_1)

For Gaussians with different variances, this is a conic section — ellipse, parabola, or hyperbola depending on the covariance ratio.


Shines

Text classification

Word counts in documents are reasonably independent given the topic. Spam filtering, sentiment analysis, language detection, document tagging all run beautifully on multinomial Naive Bayes with bag-of-words features.

Shines

Tiny training, fast inference

Training is one pass through the data to estimate per-class statistics. Prediction is one likelihood evaluation per class. Both are trivially parallelizable. For real-time per-document scoring at internet scale, NB is hard to beat on cost.

Breaks

Correlated features

The "naive" assumption hurts most when features are highly correlated. Two near-duplicate features count their evidence twice; the posterior probabilities skew dramatically even when the argmax holds. Calibration fixes the probabilities; logistic regression usually reads better.

Breaks

Continuous, non-Gaussian features

If your continuous features have heavy tails or multimodal distributions, Gaussian NB models them poorly. Bin them into categorical buckets and use multinomial NB, or switch to a tree-based model that doesn't care about distributional shape.


INFERENCE ACCURACY TRAINING SIZE
  • Inference0.95
  • Accuracy0.60
  • Training0.95
  • Small size0.95

The original Paul Graham spam filter (2002). Multinomial Naive Bayes on per-word probabilities — trained on a user's own ham and spam folders — was the algorithm that turned spam from an unsolvable mess into a routine background problem. Modern spam filters layer many models, but the foundation is still a Naive Bayes pass.


Try the wizard again →