Statistical Significance Calculator Calculator
Drop in visitor and conversion counts from your A/B test and get a p-value, confidence level, and lift interval — so you know whether the win is real before you ship it.
Statistical Significance Calculator
A tool that takes two A/B variants' visitor and conversion counts and returns a p-value, confidence level, and lift interval.
A statistical significance calculator compares the conversion rates of two test variants and tells you the probability that the observed difference is real signal rather than random noise. Under the hood it runs a two-proportion z-test on the visitor and conversion counts you provide, then converts the resulting z-score into a p-value and a confidence interval around the lift.
The job of the calculator is binary at decision time: did the variant beat the control with enough certainty to roll out, or do you need more traffic? It's the standard post-test gate before you ship a change to 100% of buyers.
A/B test significance calculator
Control visitors
Control conversions
Variant visitors
Variant conversions
Significance level (α)
0.05 = 5% false-positive rate (standard).
Statistical significance
p = 0.0284
Z-score
2.192
Relative lift
16.67%
Conversion rates
Control: 3.60% → Variant: 4.20%
Enter visitor and conversion counts for each variant. The calculator runs a two-proportion z-test and returns the p-value, the variant's lift over control, and a 95% confidence interval on that lift. Default confidence threshold is 95% (α = 0.05), two-tailed.
Use this calculator at the end of a test — after you've hit your pre-registered sample size and run for at least one full business cycle. Checking significance mid-test, then stopping early when you see a green number, is the single most common way to ship a fake win.
The math behind the calculator
z = (p_b - p_a) / sqrt( p_pool * (1 - p_pool) * (1/n_a + 1/n_b) )
p_a
Control conversion rate
Conversions in control divided by visitors in control.
p_b
Variant conversion rate
Conversions in variant divided by visitors in variant.
n_a
Control visitors
Total unique visitors assigned to the control.
n_b
Variant visitors
Total unique visitors assigned to the variant.
p_pool
Pooled conversion rate
(conversions_a + conversions_b) / (n_a + n_b) — the shared baseline used under the null hypothesis.
z
Z-score
Standardised difference between the two rates; converts to a p-value via the normal distribution.
A Shopify apparel store tests a new product-page hero against control. Control: 12,000 visitors, 360 add-to-carts (3.00%). Variant: 12,000 visitors, 420 add-to-carts (3.50%).
Control visitors (n_a): 12000
Control conversions: 360
Variant visitors (n_b): 12000
Variant conversions: 420
Pooled rate (p_pool): 0.0325
→ z ≈ 2.20, two-tailed p ≈ 0.028
p = 0.028 is below the 0.05 threshold, so you'd reject the null at 95% confidence. The variant's 16.7% relative lift is unlikely to be noise — ship it, and continue monitoring revenue per visitor in the 2 weeks after rollout.
The z-test assumes independent visitors, a binary outcome per visitor (converted / didn't), and large enough samples that the normal approximation holds — generally at least ~30 conversions per arm and conversion rates not pinned near 0% or 100%. For sparse events (refunds, high-AOV checkout completions on low traffic), Fisher's exact test is more honest.
What a real test result looks like
Sample A/B test scenarios on an apparel store running PDP variants — what the calculator returns
| Scenario | Control CR | Variant CR | Visitors per arm | Relative lift | p-value | Call |
|---|---|---|---|---|---|---|
| Clear winner | 3.00% | 3.60% | 15,000 | +20.0% | 0.004 | Ship |
| Borderline | 3.00% | 3.30% | 15,000 | +10.0% | 0.061 | Extend test |
| Underpowered | 3.00% | 3.45% | 4,000 | +15.0% | 0.213 | Inconclusive |
| Flat | 3.00% | 3.05% | 20,000 | +1.7% | 0.713 | No effect |
| Negative | 3.00% | 2.70% | 15,000 | -10.0% | 0.041 | Kill variant |
| Large win, small N | 3.00% | 5.00% | 1,200 | +66.7% | 0.012 | Replicate before shipping |
Two patterns repeat in the table. First, big relative lifts on small samples (the last row) are often regression-to-the-mean traps — replicate or extend before rolling out. Second, a result of p ≈ 0.06 isn't "almost significant" — it means you need more data, not a softer threshold.
Common ways teams misread the output
A p-value answers one specific question: if the variants were truly identical, how often would you see a difference this large by chance? It does not tell you the probability the variant is better, the size of the effect, or whether the change will hold up at full traffic. For those, look at the confidence interval on the lift — if it spans zero, you don't have a reliable directional read yet.
Don't peek, don't stop early
Checking significance every morning and ending the test the first time p drops below 0.05 inflates your false-positive rate dramatically — by some estimates from 5% to 25%+. Decide your sample size upfront, run until you hit it, then evaluate once. If you need sequential monitoring, switch to a Bayesian or always-valid test framework, not the classical z-test in this calculator.
Statistical significance calculator FAQ
95% (α = 0.05) is the standard default and what this calculator uses out of the box. Drop to 90% only for low-risk, easily reversible changes like copy tweaks where shipping a near-miss is cheap. Bump to 99% for high-risk changes — checkout flow, pricing pages — where a false positive costs real revenue.
A sample size calculator runs before the test to tell you how much traffic you need to detect a given lift. A significance calculator runs after the test to evaluate the data you actually collected. You should use both — sample size to plan, significance to decide.
Two-tailed is the safe default and what most CRO teams use, because it accounts for the variant being either better or worse than control. Use one-tailed only when a worse-than-control outcome would lead to exactly the same decision as a flat outcome — rare in practice, since you usually want to know if your variant tanked.
Not on the strength of this test alone. p = 0.07 means the data is suggestive but doesn't clear the 95% threshold. Your options are: extend the test to gather more traffic, lower your confidence threshold to 90% if the change is genuinely low-risk and pre-agreed, or treat the result as a hypothesis to retest with a sharper variant.
No — this calculator runs a two-proportion z-test for binary outcomes (converted yes/no). For continuous metrics like revenue per visitor or AOV you need a t-test or a bootstrap, because the distribution is skewed and a few high-value orders distort the variance. Test conversion rate here, then sanity-check revenue separately.
At least 30 conversions per arm for the normal approximation to behave, and ideally 200-400+ before you trust the result for a business decision. Below 30, switch to Fisher's exact test. The exact threshold depends on your minimum detectable effect — smaller lifts need much larger samples.
It's the range of relative lifts consistent with your data at the chosen confidence level. A 95% CI of [+4%, +28%] means you can be 95% confident the true lift is somewhere in that range. If the interval includes zero (e.g. [-2%, +15%]), the result isn't statistically significant regardless of the point estimate.
Not directly — comparing 3+ variants pairwise with this calculator inflates your false-positive rate through multiple comparisons. Either apply a Bonferroni correction (divide your α by the number of comparisons) or use an ANOVA-style multi-variant test. For most cases, run sequential two-variant tests instead.
Bayesian results are easier to communicate ("94% probability variant B wins") and don't suffer from peeking the same way. They're a reasonable choice, especially for stakeholders who find p-values unintuitive. The underlying decision is similar in most realistic scenarios — pick one framework and stick with it across your program rather than switching when results disagree.
About 5% of the time, an A/A test will return p < 0.05 by pure chance — that's what the 95% confidence level means. If you see it more often, check for instrumentation bugs: uneven traffic split, sample ratio mismatch, bot traffic in one arm, or events firing twice. A well-run A/A test is the fastest way to expose a broken tracking setup.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.