How to use Statistical Interpretation
Statistical interpretation is the judgment layer on top of A/B test math — knowing when a winner is real, when it's noise, and when to keep the test running.
Statistical Interpretation
The practice of reading A/B test results correctly — separating real lift from random variance and deciding when to act.
Statistical interpretation sits on top of the underlying statistical method. The math gives you a p-value, a confidence interval, and a probability-to-be-best. Interpretation decides what those numbers mean for your store, your roadmap, and the next test you queue up.
It's the part of experimentation that tools can't fully automate. Two analysts can look at the same dashboard, see a 6% lift at 92% confidence on day five, and reach opposite conclusions — one ships, one keeps running. The methodology is identical; the judgment is not. Reading results well is the difference between a CRO program that compounds and one that quietly accumulates false positives.
Most experimentation horror stories — the redesign that won the test and lost revenue, the checkout tweak that flipped sign in week two — aren't methodology failures. The stats engine did its job. Someone read the output wrong.
This guide covers the four interpretation skills that matter most: telling signal from noise, deciding when to call a test, reading segment-level results without fooling yourself, and recognising the misreads that keep showing up in post-mortems.
Signal vs noise: what a confidence number actually tells you
A 95% confidence level does not mean "there's a 95% chance the variant wins." It means that if there were truly no difference between control and variant, you'd see a result this extreme or more in only 5% of repeated experiments. That's a subtle but important distinction.
The practical implication: confidence rises and falls during a test, often dramatically. A variant can sit at 97% on day three, drop to 78% on day six, and settle at 91% by day fourteen. None of that is broken — it's the regular behaviour of a metric whose variance shrinks as the sample grows. Confidence on day three is mostly telling you about day three's buyers.
What you should look at alongside the headline number: the confidence interval width, the absolute lift, and the sample size relative to your pre-test power calculation. A 12% lift at 95% confidence on 800 conversions is a very different artefact from a 1.2% lift at 95% confidence on 80,000 conversions, even though both "pass."
Peeking is the silent killer
Checking a test daily and stopping the moment it hits 95% confidence inflates your false-positive rate to roughly 25-30%, not 5%. If you're going to look every day, use a sequential testing method (mSPRT, Bayesian) that's designed for continuous monitoring — or commit upfront to a fixed sample size and ignore the dashboard until you hit it.
When to call a test (and when to keep waiting)
The honest answer: when you've hit the sample size and duration you committed to before the test started, you call it. The reason that sounds boring is because the interesting decisions — should I stop early, should I extend, is this enough — are exactly the ones that introduce bias.
Two duration rules worth respecting. First, run for a minimum of one full business cycle — for most stores that's at least one complete week, ideally two, so that Tuesday traffic and Sunday traffic both get represented in both arms. Second, run long enough to cover the buying cycle of your category. Apparel converts in hours; furniture converts in weeks.
How confidence drifts during a typical test
Notice how the day-four spike would have triggered a "call it" in a peeking workflow — and then drifted back down. That's the exact pattern that creates winning tests in your dashboard that fail to replicate in production. The variant probably is winning, but the size of the win is much smaller than day four suggested.
Segments, guardrails, and the multiple-comparisons trap
Segment-level reads are where good interpretation earns its keep. The overall test may be flat, but mobile-on-iOS converts 14% better in the variant. Tempting headline. Almost always wrong as a standalone conclusion.
If you slice an experiment ten ways, statistical noise alone produces roughly one segment that looks "significant" at 95%. That's the multiple-comparisons problem. The defensible move: treat segment findings as hypotheses for the next test, not as conclusions to ship. Use them to shape your roadmap, not your release notes.
How to weight different signals from a finished test
| Signal | What it tells you | Confidence to act |
|---|---|---|
| Primary metric, pre-declared, hits target sample | Real effect, sized correctly | High — ship it |
| Primary metric, significant but underpowered | Effect probably exists, size unreliable | Medium — extend or replicate |
| Secondary metric moves, primary flat | Possibly real, possibly noise | Low — hypothesis for next test |
| One segment wins, overall flat | Likely multiple-comparisons artefact | Low — investigate, don't ship |
| Guardrail metric (revenue, AOV) drops | Variant has a hidden cost | Critical — block ship regardless of primary |
| Effect reverses week over week | Novelty or selection bias | Low — keep running |
Guardrail metrics deserve special attention. A checkout test can lift conversion rate 4% and drop average order value 7% — net negative revenue, but the headline number looks great. Decide your guardrails before you start (revenue per visitor, refund rate, page speed) and let them veto wins.
The misreads that show up in every post-mortem
Three patterns account for most interpretation failures. First, confusing statistical significance with practical significance: a 0.4% lift at 99% confidence is real but might not be worth the engineering cost. Second, ignoring the confidence interval — a result of "+8% lift, 95% CI [+1%, +15%]" is much shakier than the point estimate suggests. Third, single-test thinking — one experiment is a data point, not a finding.
Replication is the underrated discipline. If a win matters — revenue impact above your decision threshold, a permanent platform change — re-run it. The second test will almost always show a smaller effect than the first (regression to the mean) and that smaller number is the one you should plan against. This is the part of experiment analysis that separates teams who learn from teams who accumulate dashboards.
Write the decision before you read the result
Before unblinding a test, write down what you'd do at each outcome: "If primary lifts 2%+ with no guardrail regression, ship. If it lifts but a guardrail drops, hold and investigate. If it's flat, archive the hypothesis." Pre-committing prevents the most common interpretation failure, which is letting the result you wanted shape the conclusion you draw.
Frequently asked questions
95% is the conventional bar and works well for most product and copy tests. Drop to 90% for low-risk surface changes where you're happy with faster iteration. Raise to 99% only for changes that are expensive to roll back — checkout flow, pricing, anything touching tax or shipping logic.
Not with a fixed-horizon (frequentist) test — stopping early when you happen to peek at a significant moment inflates your false-positive rate well above the nominal 5%. If you need early-stopping flexibility, run a sequential or Bayesian test designed for it. Otherwise, commit to your sample size upfront.
Minimum two full weeks for most tests, regardless of sample size, so you cover weekday and weekend traffic plus any weekly email cycles. Longer if your buying cycle is slow — furniture, considered electronics, B2B-adjacent SKUs may need three to four weeks to capture realistic decision windows.
A confidence interval gives you the plausible range of the true effect — for example, "lift is between +1% and +9%." That range tells you both whether the effect is real (does it cross zero?) and how precisely you've measured it. A narrow interval around a small lift is more actionable than a wide interval around a big one.
Treat the secondary movement as a hypothesis, not a result. Design a follow-up test where that secondary metric is the primary, with appropriate sample size. Shipping based on a secondary win is one of the most common ways teams accumulate false positives that don't replicate in production.
This is genuinely common and usually meaningful — mobile and desktop are different products with different friction points. Re-run the experiment scoped to the device where it appears to win, with that device's traffic powering the sample-size calculation. Don't ship a global change based on a device-segment finding from a pooled test.
Three quick checks: is the lift more than 2x the MDE you powered for, did the effect appear in the first 48 hours and persist, and is there a plausible mechanism? Outsized early wins on small samples almost always shrink with more data. If you can't explain why users would behave this way, replicate before you ship.
It's directionally useful but interpretation depends on the engine. Bayesian "probability to be best" is honest about uncertainty across the full range of plausible effects; frequentist "confidence" is a long-run frequency statement that gets misread as a probability. Know which one your tool reports and read it accordingly.
Statistical significance asks whether the effect is real. Practical significance asks whether it's big enough to matter. A 0.3% lift can be statistically significant on a million sessions and still not be worth the engineering, maintenance, or complexity cost of shipping. Always pair the p-value with a minimum lift threshold.
Replicate when the financial impact is large, when the change is hard to roll back, or when the result surprised you. A second test almost always shows a smaller effect than the first — that smaller number is the one you should base business cases on. The first win tells you the direction; the replication tells you the size.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.