Vol. XII · No. 05 · May 2026
Jake Cuth.

Why my A/B simulator says false positives are 26%, not 5%.


The textbook says α = 0.05 means a 5% false-positive rate. The simulator on my A/B test lab says 26%. The lab is right and the textbook is right; they are measuring two different things.

The lab ran 1,000 simulated experiments with the true effect set to zero — A and B converted at the same rate. Each experiment peeked every 50 users and stopped the moment a two-proportion z-test crossed p < 0.05. Ten thousand peeks per arm at most, then call it a day. Twenty-six percent of the experiments declared a winner. The headline number is 5.2× nominal α, and it is the lesson the lab was built to teach.

The textbook 5% applies to one test at the end of the experiment. Peeking is many tests on growing samples, and the p-value distribution under H0 is uniform on every single peek — so the chance of ever crossing 0.05 grows as you keep looking. With 200 peeks at most you would naively expect near-certainty of an α-level cross under pure noise; in practice the dependence between successive peeks softens that to the 25–30% range across most reasonable peek schedules.

The fix that practitioners reach for first — a Bonferroni correction across peeks — works but is wasteful: it reserves almost all the power for tests that will never run. The cleaner fixes are sequential designs (mSPRT, group sequential boundaries à la O'Brien-Fleming), or just — and this is the part the lab repeats — commit to a sample size up front and only test once.

The lab also runs the same protocol across 50 random seeds because A/B testing is variance, and reporting one seed defeats the lesson. The 5–95% band on empirical FPR across those seeds usually lands between 23% and 30%. The point estimate on the page is one slice of that distribution.

If you want to see the math: the simulator is at /work/ab-test-lab/ and the Python that produced the reference numbers is notebooks/ab_test_model.py. The peeking Monte Carlo lives in peek_until_significant() and the multi-seed band lives in multi_seed_ensemble().

Stopping early is a lie. The lab proves it twenty-six times out of a hundred.

A/B testing monte carlo stats