FIG. 04 Experiments, run honestly
When A/B tests
lie.
A live A/B test simulator. Shows when the "winning" version is actually winning and when it's just noise. Stops losing experiments before they cost a full quarter.
A live A/B test simulator. Design an experiment, watch two Beta posteriors update frame-by-frame as users flow in, and see exactly how often peeking early turns noise into a "winner." All math runs in your browser. Nothing phoned home, no preloaded answers.
§ I Why most A/B tests are run wrong
Most failed experiments aren't failed by the data — they're failed by the decision. Tests get peeked at every morning, winners called the first time p < 0.05 flickers on the dashboard, and null results quietly buried because "we didn't get enough traffic."
The page below runs an honest test against itself. Set a baseline rate, set the true lift (or zero, if you're curious), and let the users flow. Two Beta posteriors update in real time. A pair of decision panels — frequentist and Bayesian — tell you what each framework would do right now. And a separate panel further down proves, in a thousand synthetic tests, what peeking actually costs.
All simulations are reproducible from a seed. The companion Python script runs the same peeking Monte Carlo in Python + scipy; the reference number is shown alongside the live number for sanity.
§ II · FIG. 04.1 The simulator — run an experiment
§ III How it works
Beta-Bernoulli conjugacy
Start with a uniform Beta(1, 1) prior on each variant's conversion rate. Each user is a Bernoulli trial: converted (+1 to α) or didn't (+1 to β). The posterior stays Beta at every step — closed-form, no numerical integration. That's why the distributions update smoothly at sixty frames per second.
Sample size, solved
The required-sample readout in the designer panel uses the two-proportion z-test formula. It's what a traditional power calculator would hand you, given baseline, minimum detectable effect, α, and 1 − β. Change any slider; the number updates instantly. Run fewer trials than that and your "null result" is almost certainly under-powered noise.
P(B > A), honestly
For the Bayesian side, we draw 5,000 samples from each Beta posterior every frame and count the fraction where B's draw exceeds A's. Takes ~2ms. Gives a probability that doesn't require a p-hacked threshold to interpret — "the test has P(B > A) = 0.91" is a statement a product manager can actually act on.
§ IV · FIG. 04.2 The peeking-bias Monte Carlo
Set the true variant rate equal to the baseline (no real lift). Then peek at the test every few dozen users and stop the moment p < 0.05 flickers. Do that a thousand times. Count how often you declared a winner.
Nominal false-positive rate is 5%. What you'll actually get is closer to 30–40% — the inflation comes entirely from the decision process, not the data.
— reference —
§ V · FIG. 04.3 Frequentist vs. Bayesian, two verdicts, one test
Same data, two philosophies. They can disagree. That's the point.
The frequentist stops when p < α at the pre-registered sample size. The Bayesian stops when P(B > A) crosses a fixed threshold (default 95%). In small-effect tests they usually agree. In ambiguous regions they don't, which is where the page earns its keep.
§ VI Receipts
§ VII Methodology & Colophon
Pure JavaScript. Each frame generates a batch of Bernoulli trials (rate scaled by the Speed dial), updates the two Beta posteriors, recomputes the running z-test p-value and a Monte Carlo estimate of P(B > A). Sixty frames per second on any device since the iPhone 8.
notebooks/ab_test_model.py ↗ runs the same peeking Monte Carlo in Python + scipy, writes the reference numbers to methodology.json. The live page fetches them and prints both side by side.
Evan Miller · How Not to Run an A/B Test ↗
Kohavi et al. · Online Controlled Experiments at Large Scale ↗
Stucchio · Bayesian A/B Testing at VWO ↗
One metric, two variants — no multi-arm or multi-metric tests. No sequential-testing adjustments (SPRT, α-spending); they'd be the honest answer to peeking, but the point of the Monte Carlo panel is to show the problem, not paper over it. Fixed uniform prior; real programs use empirical priors fit to historical tests.