P-Values

Q: What does p < 0.05 actually mean?

It means that if the variant truly had no effect, you'd see a result this extreme or more extreme less than 5% of the time by random chance. It does not mean there is a 95% chance the variant is better — that's a different quantity (a Bayesian posterior).

Q: Is a p-value the probability that the variant wins?

No. It's the probability of the data given the null hypothesis, not the probability of the hypothesis given the data. Conflating the two is the single most common error in CRO interpretation.

Q: Why is 0.05 the standard threshold?

It's a convention popularised by R.A. Fisher in the 1920s, not a mathematical truth. Some teams use stricter cutoffs (0.01) for irreversible shipping decisions and looser ones (0.10) for low-risk iterative changes.

Q: Can I stop a test as soon as p drops below 0.05?

No — that's called peeking, and it dramatically inflates your false-positive rate. If you peek daily without correction, a test designed for 5% false positives can deliver 20-30%. Decide your sample size up front and wait.

Q: What's the difference between a p-value and statistical significance?

Statistical significance is the binary label you assign after comparing the p-value to a threshold (e.g. p < 0.05 = "significant"). The p-value is the underlying continuous number. Same information, different framing.

Q: How is a p-value related to a confidence interval?

They're two views of the same test. If the 95% confidence interval for the difference excludes zero, the p-value is below 0.05. Confidence intervals are usually more useful because they show effect size, not just whether something is non-zero.

Q: What's a one-tailed vs two-tailed p-value?

Two-tailed tests for any difference (variant could be better or worse) and is the default in CRO. One-tailed tests only in one direction and halves the p-value — use it only when a worse outcome is truly impossible, which is rare.

Q: Why did my p-value get smaller as the test ran longer?

With a true underlying lift, p-values shrink as sample size grows because you're accumulating evidence. With no real lift, p-values bounce around randomly. This is why pre-committed sample size matters.

Q: Should I use Bayesian methods instead?

Bayesian A/B testing gives you the intuitive answer most stakeholders actually want — "probability variant beats control" and "expected loss if we ship the wrong one." Frequentist p-values remain the industry default, but Bayesian is increasingly common in modern testing tools.

Q: What p-value do I need for a multi-variant test?

Lower than 0.05 per comparison, because running multiple variants inflates the chance one looks significant by accident. A Bonferroni correction (divide your threshold by the number of variants) is the simplest fix; for 4 variants vs control, that's p < 0.0125.

Metricuno

May 19, 2026

4 min read

Quick answer

A p-value is the probability of seeing your A/B test result (or something more extreme) if the variant actually did nothing. Here's how to read it correctly.

Definition

Statistical Analysis

P-Value

The probability of observing your test result, or something more extreme, if the variant actually had no effect.

A p-value is a conditional probability produced by a statistical test. It answers a narrow question: assuming the null hypothesis is true — that your variant and control perform identically — how often would random sampling alone produce a difference at least as large as the one you saw?

Small p-values mean the observed gap is unlikely under that no-effect assumption, which is why teams treat them as evidence against the null. They do not tell you the probability that your variant wins, the size of the lift, or how confident you should be in shipping. That is the single biggest source of misreads in CRO.

Also known as

p-value

observed significance level

In an A/B test, the p-value is the output of a significance test (usually a two-sample z-test or t-test on conversion rate) run against the data you collected. A p of 0.03 means: if the variant truly did nothing, you would see a difference this large or larger in roughly 3 out of every 100 tests by chance alone.

The conventional cutoff in CRO is p < 0.05, inherited from frequentist statistical analysis. That threshold is a convention, not a law of physics — and it does not adjust for peeking, multiple variants, or the business cost of a wrong call. Treat it as one input into a shipping decision, not the decision itself.

Formula

p = 2 * (1 - Φ(|z|)) where z = (p_b - p_a) / sqrt( p_pool * (1 - p_pool) * (1/n_a + 1/n_b) )

Variables

p_a

Control conversion rate

Observed conversion rate in the control group (A).

p_b

Variant conversion rate

Observed conversion rate in the variant group (B).

n_a

Control sample size

Number of visitors assigned to the control.

n_b

Variant sample size

Number of visitors assigned to the variant.

p_pool

Pooled conversion rate

Combined conversion rate across both groups: (conversions_a + conversions_b) / (n_a + n_b).

Z-score

Standardised distance between the two conversion rates.

Standard normal CDF

Cumulative distribution function of the standard normal.

Worked example

An apparel Shopify store tests a new product-page layout. Control: 12,000 visitors, 360 add-to-carts (3.00%). Variant: 12,000 visitors, 432 add-to-carts (3.60%).

p_a: 0.0300

p_b: 0.0360

n_a: 12000

n_b: 12000

p_pool: 0.0330

→ z ≈ 2.59, two-sided p ≈ 0.0096

Under the null hypothesis of no real difference, you'd see a gap this large or larger about 1% of the time by chance. Below the 0.05 threshold — most teams would call the variant a winner, assuming the test ran to its pre-planned sample size.

Two practical notes. First, the formula above is for binary conversion outcomes; revenue-per-visitor or AOV tests use a t-test on continuous data and produce a different p-value. Second, the p-value gets smaller as your sample grows even when the underlying lift is tiny — which is why effect size and confidence interval matter as much as the p itself.

Benchmark

How CRO teams typically read p-values in A/B tests

P-value range	Conventional label	Typical CRO action	Caveat
p < 0.01	Highly significant	Ship the variant	Still check effect size and segment stability
0.01 ≤ p < 0.05	Significant	Ship if test reached planned sample size	Peeking inflates false positives — don't stop early
0.05 ≤ p < 0.10	Marginal / trending	Extend the test or iterate	Common in underpowered tests with <10k visitors per arm
0.10 ≤ p < 0.20	Inconclusive	No ship; revisit hypothesis	Often the variant is genuinely flat
p ≥ 0.20	No evidence	Kill the variant	Treat as a learning, not a failure

The most common misread: treating p = 0.03 as "there's a 97% chance the variant wins." That is not what it says. The p-value conditions on the null being true; it does not give you the probability that the null is false. For a direct "probability variant beats control" reading, you want a Bayesian A/B test, which outputs exactly that quantity.

Frequently asked

P-values in A/B testing: FAQ