Experimentation Benchmarks Benchmarks
Public CRO program data shows win rates of 15-25%, median effects of 2-6%, and 3-5 week runtimes. Use these benchmarks to set honest expectations.
Experimentation Benchmarks
Reference ranges for A/B test win rates, effect sizes, sample sizes, and runtime drawn from public CRO program data.
Experimentation benchmarks summarise how a typical conversion-rate-optimization program performs across four dimensions: the share of tests that beat control (win rate), how large the winning lifts tend to be (effect size), how much traffic each test consumed (sample size), and how long it ran (runtime). Public datasets from CRO tools and agencies put winning tests at roughly 15-25% of all decisions, with median lifts of 2-6% on the primary metric and runtimes of 3-5 weeks per test.
These ranges matter because they anchor what's actually achievable. A roadmap that assumes 50% win rates and 10% lifts will overpromise; a program tracked against the real distribution can plan sample sizes, set realistic ROI, and avoid stopping tests early.
The single number people quote — "around 1 in 5 tests wins" — is roughly right but hides most of the useful detail. Win rate climbs sharply with program maturity, hypothesis quality, and how strictly "win" is defined (statistical significance, practical significance, or just directional).
The numbers below pull from publicly reported aggregates across Shopify, WooCommerce, and Magento stores in the €1M-€15M revenue band — the segment where most CRO programs actually live. Treat them as orientation, not a target: your traffic mix, AOV, and test discipline shift every cell.
Typical experimentation program benchmarks by maturity level
| Metric | New program (yr 1) | Established (yr 2-3) | Mature (yr 4+) |
|---|---|---|---|
| Win rate (% of tests beating control at 95% sig.) | 10-15% | 18-22% | 22-28% |
| Median effect size on primary metric | +2-4% | +3-5% | +4-7% |
| Average sample size per variant | 35-60k | 50-90k | 80-150k |
| Median runtime per test | 21 days | 18 days | 14 days |
| Tests shipped per quarter | 3-6 | 8-14 | 15-25 |
| Share of tests with conclusive result | 55-65% | 70-80% | 80-90% |
Read the table as a trajectory, not a scoreboard. Year-one programs lose most of their tests to inconclusive results, not to losers — the fix is usually bigger swings on higher-traffic pages, not more tests. By year three, the bottleneck flips to hypothesis quality and prioritisation.
Win rate by hypothesis source
What actually drives win rate
Hypothesis source is the single biggest lever. Tests built from real funnel drop-off data and session replay win roughly 2-3x more often than tests copied from competitors or pulled from "best practice" listicles. That's why programs that start with a GA4 audit tend to outperform programs that start with a backlog of opinions.
The second lever is swing size. Tests that change one button colour rarely produce a 4% lift on revenue per visitor — the mechanism just isn't strong enough. Re-architecting a product detail page, restructuring checkout, or changing pricing presentation moves the needle harder, at the cost of more design effort and slightly more variance.
Watch for inflated win rates
If your program is reporting 40%+ win rates, check the definition. Common culprits: peeking and stopping early, using one-sided tests, counting micro-conversion lifts as wins, and not adjusting for multiple comparisons. A real 25% win rate at 95% significance beats a fake 50% every time, because the fake number doesn't survive contact with the bottom line.
Planning sample size and runtime
Most stores in this revenue band run 200k-800k monthly sessions, which gives you roughly 50k-150k visitors per variant in a two-week test. With a 2.5% baseline conversion rate, that's enough power to detect a relative lift of around 8-12% at 80% power — meaning anything smaller will look inconclusive even when it's a real win.
The practical implication: don't test changes you only expect to move conversion by 1-2% unless you have enterprise-level traffic. Either stack the change with other improvements into a bigger swing, or test on a higher-traffic surface like the homepage or PDP where the same relative lift produces more absolute revenue.
Frequently asked questions
For an established program, 18-25% of tests beating control at 95% significance is a healthy range. New programs typically sit at 10-15% in year one and improve as hypothesis quality and prioritisation tighten.
The two most common causes are weak hypotheses (copied from competitors or based on opinion) and under-powered tests (too few visitors or too small an expected lift). Audit your last 10 tests: how many were built from actual user-behaviour data?
Median winning effect sizes land between 2% and 6% relative lift on the primary metric. Lifts above 10% are rare and usually involve major UX changes like checkout restructuring or pricing presentation, not small copy or colour tweaks.
Two to four full weeks is the working norm — long enough to cover at least one full business cycle including weekends and any weekly email send. Tests stopped before 14 days frequently flip direction once novelty effects fade.
Mature programs ship 15-25 tests per quarter; year-one programs often manage 3-6. Velocity matters less than win-weighted lift — five well-built tests can produce more revenue than fifteen rushed ones.
Partially. Win rates and effect sizes are roughly portable; sample-size and runtime numbers are not. Stores under 100k monthly sessions usually need to test bigger changes on higher-traffic pages, or accept that some tests stay inconclusive.
In year-one programs, 35-45% of tests end inconclusive — neither variant clearly wins. By year three this drops below 20%, mostly because teams get better at sizing tests to detect realistic effects on their actual traffic.
Mobile tests tend to show slightly lower absolute conversion rates but similar relative lifts. Effect sizes on mobile-specific UX changes (sticky CTAs, simplified forms) often run 1-2 percentage points higher than the same change on desktop.
Yes. A statistically significant 0.4% lift on a low-margin SKU may not pay back the engineering cost. Set a minimum detectable effect tied to revenue impact, and only count tests that clear both bars as wins for ROI reporting.
Experimentation benchmarks focus on the testing program itself — win rates, runtime, sample size. Broader CRO benchmarks cover the underlying conversion rates, AOV, and funnel metrics those tests aim to improve. Use them together when scoping a roadmap.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.