Confidence Intervals
A confidence interval is the range your true experiment effect likely falls within — a far more honest summary of an A/B test result than a single p-value.
Confidence Intervals
A range that likely contains the true effect of a test, with a stated confidence level — typically 95%.
A confidence interval (CI) is a range of values, calculated from your experiment data, that is likely to contain the true underlying effect — for example, the real conversion-rate lift of a checkout variant over the control. The most common version is the 95% CI, meaning that if you repeated the test many times, about 95% of the intervals you'd compute would capture the true effect.
Unlike a p-value, which collapses the result into a binary 'significant or not', a confidence interval communicates two things at once: the magnitude of the effect and how much uncertainty remains around it. That makes it the more honest way to report an A/B test outcome to a stakeholder.
Most CRO teams stop reading a test the moment p < 0.05 lights up green. That's a mistake. A p-value tells you whether the lift is unlikely under the null hypothesis; it does not tell you how big the lift actually is, or how wide the plausible range around it might be.
A confidence interval fixes that. A result of '+8% lift, 95% CI [+1%, +15%]' is genuinely useful — you know the true effect is probably positive, but it could realistically be anywhere from 'barely worth shipping' to 'huge win'. Compare that to '+8%, p = 0.03', which hides the same uncertainty behind a green tick.
CI = effect ± z * SE
effect
Observed effect
The measured difference between variant and control (absolute or relative lift).
z
Critical value
For 95% confidence, z ≈ 1.96. For 90% use 1.645, for 99% use 2.576.
SE
Standard error
Standard deviation of the sampling distribution of the effect — depends on conversion rates and sample size per variant.
A Shopify apparel store tests a new checkout layout. Control converts 4.0% over 10,000 sessions; variant converts 4.6% over 10,000 sessions.
Observed absolute lift: 0.6 percentage points
Standard error (pooled): ≈ 0.29 pp
z (95% confidence): 1.96
→ 95% CI: [0.03 pp, 1.17 pp] — equivalent to a relative lift between roughly +0.8% and +29%.
The interval excludes zero, so the result is statistically significant — but the lower bound is barely positive. The variant probably wins, but the magnitude is highly uncertain. Either run longer for a tighter interval, or ship with eyes open.
Two practical rules of thumb. First: if the CI excludes zero, the result is statistically significant at that confidence level. Second: the width of the interval tells you how much you actually know — a tight CI around a meaningful effect is a confident win; a wide CI that barely excludes zero is a coin flip wearing a tuxedo.
Typical 95% CI width for a checkout test detecting a 10% relative lift, by sample size per variant (baseline conversion 3%)
| Sessions per variant | Detectable effect range (relative) | CI width (±) | Verdict |
|---|---|---|---|
| 5,000 | −25% to +45% | ±35 pp | Useless — interval contains both losers and big winners |
| 20,000 | −8% to +28% | ±18 pp | Still too wide — straddles zero |
| 50,000 | +1% to +19% | ±9 pp | Just significant — direction known, magnitude vague |
| 100,000 | +4% to +16% | ±6 pp | Confident win — tight enough to forecast |
| 250,000 | +7% to +13% | ±3 pp | Best in class — ship and model the revenue impact |
When you read a CI in a test report, look at three things in order: does it exclude zero, is the lower bound large enough to matter commercially, and is the upper bound realistic given your baseline. A CI of [+0.1%, +40%] on a single payment-button colour change is a sign you stopped the test too early, not a sign of a 40% lift.
Confidence intervals: frequently asked questions
It means that if you repeated the experiment many times under identical conditions, about 95% of the intervals you'd compute would contain the true effect. It does NOT mean there's a 95% probability the true effect is inside this specific interval — that's a common misreading, though for practical CRO purposes the two interpretations rarely lead to different decisions.
A p-value gives a single binary answer ('significant or not') and hides magnitude. A confidence interval shows both the size of the effect and the uncertainty around it, so you can decide whether the lift is large enough to justify shipping, not just whether it's non-zero.
Run more traffic. CI width shrinks roughly with the square root of sample size, so quadrupling your sample halves the interval. You can also reduce variance by segmenting (e.g. excluding bot traffic) or by using a CUPED-style variance reduction technique.
A confidence interval is frequentist — it's a property of the procedure, not the effect. A credible interval is Bayesian — it's a direct probability statement about the effect given the data and a prior. Bayesian A/B testing tools (and most modern CRO platforms) report credible intervals; classical stats tools report confidence intervals.
Yes. A CI of [−1%, +6%] tells you the variant is probably neutral-to-positive and almost certainly not a big loser. That's useful evidence to combine with qualitative data or to decide whether to keep iterating on a hypothesis rather than killing it outright.
Larger samples produce tighter intervals because the standard error shrinks. As a rule of thumb, doubling your sample size narrows the CI by about 30%, and quadrupling it halves the CI. This is why under-powered tests produce wide, ambiguous intervals even when the point estimate looks impressive.
95% is the industry default and a reasonable balance. Use 90% only if you're running many low-risk exploratory tests and can tolerate more false positives. Use 99% for high-stakes changes (pricing, checkout flow) where shipping a wrong variant is expensive. The higher the confidence level, the wider the interval — there's no free lunch.
Intervals computed on ratios or relative lifts (rather than absolute differences) are often asymmetric — the upper bound can be further from the point estimate than the lower bound. This is normal and reflects the underlying distribution; don't assume a reporting bug.
They are two views of the same statistical analysis. If a 95% CI excludes zero, the result is statistically significant at p < 0.05. The CI adds magnitude and uncertainty information that the p-value alone hides — which is why most modern experimentation platforms report both.
No. Repeatedly checking a test and stopping when the CI excludes zero inflates your false-positive rate well above 5%. Either pre-commit to a sample size, or use a sequential testing method (e.g. mSPRT, group-sequential designs) that adjusts the interval for the peeking.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.