FIG. 13 · ATLASNaive Bayes

Bayes' theorem,
applied bluntly.

Apply Bayes' theorem under the assumption that features are conditionally independent. Wrong assumption, surprisingly useful answer — especially when text is the data.

Click anywhere to add a labeled point. The dashed ellipses show the 1-sigma boundary of each class's fitted Gaussian. The decision surface is wherever the posterior of the two classes is equal — a quadratic curve, not a straight line.

Filed

Engine

Pure JS · Gaussian NB with diagonal covariance

Source

naive-bayes-lab.js ↗

§ IThe fitted Gaussians

Each class's mean and per-axis variance is estimated from its points. Predict by comparing log-likelihoods plus the class prior.

dashed ellipses = 1-sigma per class

§ IIHow it works

Bayes' theorem says posterior ∝ likelihood × prior. For classification, predict whichever class has the highest posterior probability given the features. Naive Bayes makes one ruthless simplification: assume features are conditionally independent given the class. That assumption is almost always wrong — hence "naive" — but it makes the math trivial and the model fast.

For continuous features (this demo), each feature within a class is modeled as a univariate Gaussian. Multinomial Naive Bayes does the same with word counts and is famously strong on text. The two variants share the same skeleton: estimate per-class likelihoods independently and multiply them at prediction time.

The naive assumption breaks reality, but its effect on argmax classification is often muted. Even when the probabilities are wildly miscalibrated, the relative ordering (which class is most likely?) is frequently correct. That's why Naive Bayes survives in spam filters, sentiment classifiers, and document categorization decades after better-calibrated models existed.

The math

For features x = (x_1, …, x_d) and classes C_k:

P(C_k | x) ∝ P(C_k) · ∏_(j=1..d) P(x_j | C_k)

For Gaussian features, P(x_j | C_k) = N(x_j; μ_(jk), σ²_(jk)) — one mean and one variance per (feature, class) pair.

The decision boundary in 2D is the curve where two posteriors are equal:

log P(C_0)P(x | C_0) = log P(C_1)P(x | C_1)

For Gaussians with different variances, this is a conic section — ellipse, parabola, or hyperbola depending on the covariance ratio.

§ IIIWhere it shines, where it breaks

Shines

Text classification

Word counts in documents are reasonably independent given the topic. Spam filtering, sentiment analysis, language detection, document tagging all run beautifully on multinomial Naive Bayes with bag-of-words features.

Shines

Tiny training, fast inference

Training is one pass through the data to estimate per-class statistics. Prediction is one likelihood evaluation per class. Both are trivially parallelizable. For real-time per-document scoring at internet scale, NB is hard to beat on cost.

Breaks

Correlated features

The "naive" assumption hurts most when features are highly correlated. Two near-duplicate features count their evidence twice; the posterior probabilities skew dramatically even when the argmax holds. Calibration fixes the probabilities; logistic regression usually reads better.

Breaks

Continuous, non-Gaussian features

If your continuous features have heavy tails or multimodal distributions, Gaussian NB models them poorly. Bin them into categorical buckets and use multinomial NB, or switch to a tree-based model that doesn't care about distributional shape.

§ IVTrade-off scorecard

Inference0.95
Accuracy0.60
Training0.95
Small size0.95

§ VIn production

The original Paul Graham spam filter (2002). Multinomial Naive Bayes on per-word probabilities — trained on a user's own ham and spam folders — was the algorithm that turned spam from an unsolvable mess into a routine background problem. Modern spam filters layer many models, but the foundation is still a Naive Bayes pass.

§ VICompare to

Logistic Regression

Calibrated probabilities

SVM

Higher accuracy on text

Decision Tree

Different bias entirely

Try the wizard again →