Confidence Intervals

Q: What does a 95% confidence interval actually mean?

It means that if you repeated the experiment many times under identical conditions, about 95% of the intervals you'd compute would contain the true effect. It does NOT mean there's a 95% probability the true effect is inside this specific interval — that's a common misreading, though for practical CRO purposes the two interpretations rarely lead to different decisions.

Q: Why are confidence intervals better than p-values?

A p-value gives a single binary answer ('significant or not') and hides magnitude. A confidence interval shows both the size of the effect and the uncertainty around it, so you can decide whether the lift is large enough to justify shipping, not just whether it's non-zero.

Q: How do I narrow a confidence interval?

Run more traffic. CI width shrinks roughly with the square root of sample size, so quadrupling your sample halves the interval. You can also reduce variance by segmenting (e.g. excluding bot traffic) or by using a CUPED-style variance reduction technique.

Q: What is the difference between a confidence interval and a credible interval?

A confidence interval is frequentist — it's a property of the procedure, not the effect. A credible interval is Bayesian — it's a direct probability statement about the effect given the data and a prior. Bayesian A/B testing tools (and most modern CRO platforms) report credible intervals; classical stats tools report confidence intervals.

Q: Can a confidence interval include zero and still be useful?

Yes. A CI of [−1%, +6%] tells you the variant is probably neutral-to-positive and almost certainly not a big loser. That's useful evidence to combine with qualitative data or to decide whether to keep iterating on a hypothesis rather than killing it outright.

Q: How does sample size affect the confidence interval?

Larger samples produce tighter intervals because the standard error shrinks. As a rule of thumb, doubling your sample size narrows the CI by about 30%, and quadrupling it halves the CI. This is why under-powered tests produce wide, ambiguous intervals even when the point estimate looks impressive.

Q: Should I use a 90%, 95%, or 99% confidence interval?

95% is the industry default and a reasonable balance. Use 90% only if you're running many low-risk exploratory tests and can tolerate more false positives. Use 99% for high-stakes changes (pricing, checkout flow) where shipping a wrong variant is expensive. The higher the confidence level, the wider the interval — there's no free lunch.

Q: Why is my confidence interval asymmetric?

Intervals computed on ratios or relative lifts (rather than absolute differences) are often asymmetric — the upper bound can be further from the point estimate than the lower bound. This is normal and reflects the underlying distribution; don't assume a reporting bug.

Q: How are confidence intervals related to statistical significance?

They are two views of the same statistical analysis. If a 95% CI excludes zero, the result is statistically significant at p < 0.05. The CI adds magnitude and uncertainty information that the p-value alone hides — which is why most modern experimentation platforms report both.

Q: Can I trust a confidence interval if I peeked at the test early?

No. Repeatedly checking a test and stopping when the CI excludes zero inflates your false-positive rate well above 5%. Either pre-commit to a sample size, or use a sequential testing method (e.g. mSPRT, group-sequential designs) that adjusts the interval for the peeking.

Metricuno

May 19, 2026

4 min read

Quick answer

A confidence interval is the range your true experiment effect likely falls within — a far more honest summary of an A/B test result than a single p-value.

Definition

Statistical Analysis

Confidence Intervals

A range that likely contains the true effect of a test, with a stated confidence level — typically 95%.

A confidence interval (CI) is a range of values, calculated from your experiment data, that is likely to contain the true underlying effect — for example, the real conversion-rate lift of a checkout variant over the control. The most common version is the 95% CI, meaning that if you repeated the test many times, about 95% of the intervals you'd compute would capture the true effect.

Unlike a p-value, which collapses the result into a binary 'significant or not', a confidence interval communicates two things at once: the magnitude of the effect and how much uncertainty remains around it. That makes it the more honest way to report an A/B test outcome to a stakeholder.

Also known as

95% CI

interval estimate

Most CRO teams stop reading a test the moment p < 0.05 lights up green. That's a mistake. A p-value tells you whether the lift is unlikely under the null hypothesis; it does not tell you how big the lift actually is, or how wide the plausible range around it might be.

A confidence interval fixes that. A result of '+8% lift, 95% CI [+1%, +15%]' is genuinely useful — you know the true effect is probably positive, but it could realistically be anywhere from 'barely worth shipping' to 'huge win'. Compare that to '+8%, p = 0.03', which hides the same uncertainty behind a green tick.

Formula

CI = effect ± z * SE

Variables

effect

Observed effect

The measured difference between variant and control (absolute or relative lift).

Critical value

For 95% confidence, z ≈ 1.96. For 90% use 1.645, for 99% use 2.576.

Standard error

Standard deviation of the sampling distribution of the effect — depends on conversion rates and sample size per variant.

Worked example

A Shopify apparel store tests a new checkout layout. Control converts 4.0% over 10,000 sessions; variant converts 4.6% over 10,000 sessions.

Observed absolute lift: 0.6 percentage points

Standard error (pooled): ≈ 0.29 pp

z (95% confidence): 1.96

→ 95% CI: [0.03 pp, 1.17 pp] — equivalent to a relative lift between roughly +0.8% and +29%.

The interval excludes zero, so the result is statistically significant — but the lower bound is barely positive. The variant probably wins, but the magnitude is highly uncertain. Either run longer for a tighter interval, or ship with eyes open.

Two practical rules of thumb. First: if the CI excludes zero, the result is statistically significant at that confidence level. Second: the width of the interval tells you how much you actually know — a tight CI around a meaningful effect is a confident win; a wide CI that barely excludes zero is a coin flip wearing a tuxedo.

Benchmark

Typical 95% CI width for a checkout test detecting a 10% relative lift, by sample size per variant (baseline conversion 3%)

Sessions per variant	Detectable effect range (relative)	CI width (±)	Verdict
5,000	−25% to +45%	±35 pp	Useless — interval contains both losers and big winners
20,000	−8% to +28%	±18 pp	Still too wide — straddles zero
50,000	+1% to +19%	±9 pp	Just significant — direction known, magnitude vague
100,000	+4% to +16%	±6 pp	Confident win — tight enough to forecast
250,000	+7% to +13%	±3 pp	Best in class — ship and model the revenue impact

When you read a CI in a test report, look at three things in order: does it exclude zero, is the lower bound large enough to matter commercially, and is the upper bound realistic given your baseline. A CI of [+0.1%, +40%] on a single payment-button colour change is a sign you stopped the test too early, not a sign of a 40% lift.

Frequently asked

Confidence intervals: frequently asked questions