P-Values
A p-value is the probability of seeing your A/B test result (or something more extreme) if the variant actually did nothing. Here's how to read it correctly.
P-Value
The probability of observing your test result, or something more extreme, if the variant actually had no effect.
A p-value is a conditional probability produced by a statistical test. It answers a narrow question: assuming the null hypothesis is true — that your variant and control perform identically — how often would random sampling alone produce a difference at least as large as the one you saw?
Small p-values mean the observed gap is unlikely under that no-effect assumption, which is why teams treat them as evidence against the null. They do not tell you the probability that your variant wins, the size of the lift, or how confident you should be in shipping. That is the single biggest source of misreads in CRO.
In an A/B test, the p-value is the output of a significance test (usually a two-sample z-test or t-test on conversion rate) run against the data you collected. A p of 0.03 means: if the variant truly did nothing, you would see a difference this large or larger in roughly 3 out of every 100 tests by chance alone.
The conventional cutoff in CRO is p < 0.05, inherited from frequentist statistical analysis. That threshold is a convention, not a law of physics — and it does not adjust for peeking, multiple variants, or the business cost of a wrong call. Treat it as one input into a shipping decision, not the decision itself.
p = 2 * (1 - Φ(|z|)) where z = (p_b - p_a) / sqrt( p_pool * (1 - p_pool) * (1/n_a + 1/n_b) )
p_a
Control conversion rate
Observed conversion rate in the control group (A).
p_b
Variant conversion rate
Observed conversion rate in the variant group (B).
n_a
Control sample size
Number of visitors assigned to the control.
n_b
Variant sample size
Number of visitors assigned to the variant.
p_pool
Pooled conversion rate
Combined conversion rate across both groups: (conversions_a + conversions_b) / (n_a + n_b).
z
Z-score
Standardised distance between the two conversion rates.
Φ
Standard normal CDF
Cumulative distribution function of the standard normal.
An apparel Shopify store tests a new product-page layout. Control: 12,000 visitors, 360 add-to-carts (3.00%). Variant: 12,000 visitors, 432 add-to-carts (3.60%).
p_a: 0.0300
p_b: 0.0360
n_a: 12000
n_b: 12000
p_pool: 0.0330
→ z ≈ 2.59, two-sided p ≈ 0.0096
Under the null hypothesis of no real difference, you'd see a gap this large or larger about 1% of the time by chance. Below the 0.05 threshold — most teams would call the variant a winner, assuming the test ran to its pre-planned sample size.
Two practical notes. First, the formula above is for binary conversion outcomes; revenue-per-visitor or AOV tests use a t-test on continuous data and produce a different p-value. Second, the p-value gets smaller as your sample grows even when the underlying lift is tiny — which is why effect size and confidence interval matter as much as the p itself.
How CRO teams typically read p-values in A/B tests
| P-value range | Conventional label | Typical CRO action | Caveat |
|---|---|---|---|
| p < 0.01 | Highly significant | Ship the variant | Still check effect size and segment stability |
| 0.01 ≤ p < 0.05 | Significant | Ship if test reached planned sample size | Peeking inflates false positives — don't stop early |
| 0.05 ≤ p < 0.10 | Marginal / trending | Extend the test or iterate | Common in underpowered tests with <10k visitors per arm |
| 0.10 ≤ p < 0.20 | Inconclusive | No ship; revisit hypothesis | Often the variant is genuinely flat |
| p ≥ 0.20 | No evidence | Kill the variant | Treat as a learning, not a failure |
The most common misread: treating p = 0.03 as "there's a 97% chance the variant wins." That is not what it says. The p-value conditions on the null being true; it does not give you the probability that the null is false. For a direct "probability variant beats control" reading, you want a Bayesian A/B test, which outputs exactly that quantity.
P-values in A/B testing: FAQ
It means that if the variant truly had no effect, you'd see a result this extreme or more extreme less than 5% of the time by random chance. It does not mean there is a 95% chance the variant is better — that's a different quantity (a Bayesian posterior).
No. It's the probability of the data given the null hypothesis, not the probability of the hypothesis given the data. Conflating the two is the single most common error in CRO interpretation.
It's a convention popularised by R.A. Fisher in the 1920s, not a mathematical truth. Some teams use stricter cutoffs (0.01) for irreversible shipping decisions and looser ones (0.10) for low-risk iterative changes.
No — that's called peeking, and it dramatically inflates your false-positive rate. If you peek daily without correction, a test designed for 5% false positives can deliver 20-30%. Decide your sample size up front and wait.
Statistical significance is the binary label you assign after comparing the p-value to a threshold (e.g. p < 0.05 = "significant"). The p-value is the underlying continuous number. Same information, different framing.
They're two views of the same test. If the 95% confidence interval for the difference excludes zero, the p-value is below 0.05. Confidence intervals are usually more useful because they show effect size, not just whether something is non-zero.
Two-tailed tests for any difference (variant could be better or worse) and is the default in CRO. One-tailed tests only in one direction and halves the p-value — use it only when a worse outcome is truly impossible, which is rare.
With a true underlying lift, p-values shrink as sample size grows because you're accumulating evidence. With no real lift, p-values bounce around randomly. This is why pre-committed sample size matters.
Bayesian A/B testing gives you the intuitive answer most stakeholders actually want — "probability variant beats control" and "expected loss if we ship the wrong one." Frequentist p-values remain the industry default, but Bayesian is increasingly common in modern testing tools.
Lower than 0.05 per comparison, because running multiple variants inflates the chance one looks significant by accident. A Bonferroni correction (divide your threshold by the number of variants) is the simplest fix; for 4 variants vs control, that's p < 0.0125.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.