How to use A/B Testing Mistakes

Metricuno

May 19, 2026

8 min read

Quick answer

A field guide to the recurring A/B testing mistakes that produce false wins and shipped losers: peeking, underpowered tests, novelty effects, and segment-confounded interpretation — with concrete fixes for each.

Definition

Experimentation

A/B Testing Mistakes

The recurring methodological errors — peeking, low power, novelty effects, segment confounds — that produce false A/B test wins and shipped losers.

A/B testing mistakes are the procedural and statistical errors that make experiment results unreliable even when the tooling worked correctly. The damage isn't loud: tests look significant, dashboards show lifts, and teams ship variants — but the lift doesn't reproduce in revenue.

The usual suspects fall into four families. Statistical errors (peeking, stopping early, ignoring multiple comparisons) inflate false positives. Design errors (low power, contaminated audiences, novelty exposure) bias the measurement. Interpretation errors (segment cherry-picking, ignoring guardrails) turn noise into narrative. Process errors (no pre-registration, no holdouts, no post-test review) let bad habits compound across an experimentation program.

Also known as

A/B test errors

experimentation pitfalls

false positive A/B tests

Most teams running A/B testing think their failure mode is "not enough tests." It usually isn't. The harder failure is that a meaningful share of declared winners — industry estimates put it between 30% and 50% — don't replicate when re-run or held out.

That gap is almost always a methodology problem, not a tooling problem. The good news: the mistakes are well-known and finite. The list below is what we see most often when we audit experimentation programs on Shopify and WooCommerce stores in the €1M–€15M revenue band.

Statistical mistakes that inflate false positives

Peeking is the most common offender. Every time you check a running test and decide whether to stop based on the current p-value, you're effectively running multiple tests. A test designed for a 5% false-positive rate, checked daily for two weeks, can hit a nominal "significant" reading 25–35% of the time under the null.

Stopping early compounds it. A variant jumps ahead on day three because of a Tuesday email campaign, the team calls it, and ships. Two weeks later the lift evaporates. The fix is straightforward: commit to a sample size before the test starts, or use a sequential testing method (mSPRT, Bayesian with stopping rules) that explicitly accounts for repeated looks.

The third statistical landmine is multiple comparisons. If you test five variants against a control, or evaluate the same test against eight metrics, your effective false-positive rate balloons. Five variants at α=0.05 gives roughly a 23% chance of at least one false win. Apply a Bonferroni or Holm correction, or use a hierarchical Bayesian model.

The peeking tax

Checking a test 10 times during its run roughly triples your false-positive rate versus checking only at the pre-committed end. If you can't resist looking, at least don't let early peeks influence the stop decision — log them and ignore them until the planned sample is reached.

Design mistakes that bias the measurement

Underpowered tests are the silent killer. A store doing 80,000 sessions a month with a 2.3% conversion rate needs roughly 35,000 visitors per variant to reliably detect a 10% relative lift. Most teams run shorter, smaller tests and then wonder why "flat" results keep happening — they're not flat, they're invisible against the noise floor.

Novelty effects bias the other direction. A radically restyled PDP can get a 6% lift in week one purely because returning visitors notice the change and click around. By week three the lift is 1%. If your test ran for ten days, you shipped the week-one number. Always include at least one full purchase cycle and segment new vs returning visitors in the readout.

Chart

How an apparent A/B test lift decays as novelty wears off

Returning visitors

New visitors

Audience contamination is the other design trap. Same user gets variant A on mobile and variant B on desktop because the assignment key is the session, not the user. Email campaigns land in the middle of the test and skew one variant's traffic mix. Always assign on a persistent ID (Shopify customer ID or a long-lived cookie), and pause major paid pushes during the run if possible.

Interpretation mistakes that turn noise into narrative

Segment cherry-picking is the most seductive interpretation error. The overall test is flat, but the variant won by 14% on mobile Safari users in Germany — so you ship it for that segment. The problem: if you slice your test into 20 segments, one of them will look significant by chance alone. Pre-register the segments you'll examine, or apply a correction across all post-hoc slices.

Ignoring guardrail metrics is the other common failure. The variant lifts add-to-cart by 8% but drops average order value by 5% — net revenue is flat or negative. Every test should declare a primary metric, two or three secondary metrics, and at least one revenue-per-visitor guardrail. Calling a winner on conversion rate alone is how stores end up shipping changes that hurt the P&L.

Benchmark

Common A/B testing mistakes by frequency and revenue impact

Mistake	How often we see it	Typical false-win rate	Severity
Peeking + early stopping	70% of teams	25–35%	High
Underpowered tests	60% of teams	n/a (false negatives)	High
Ignoring novelty effects	55% of teams	20–30%	Medium
Segment cherry-picking	45% of teams	30–50%	High
No guardrail metrics	50% of teams	n/a (hidden losers)	High
Multiple comparison neglect	65% of teams	15–25%	Medium
Sample ratio mismatch ignored	40% of teams	varies	Critical
No post-test holdout	80% of teams	n/a (compounding error)	Medium

Sample ratio mismatch (SRM) deserves its own callout. If your 50/50 split is actually arriving as 48.2/51.8, something is wrong with assignment, tracking, or both — and the test result is unreliable regardless of the p-value. A chi-square check on the assignment ratio takes 30 seconds and should be the first thing you look at on any test readout.

Process mistakes that compound across a program

The slowest-burning mistakes are organisational. No pre-registered hypothesis means the team rationalises whatever the data shows — "we always thought mobile users would respond differently." No documented stopping criteria means tests run until someone is bored or impatient. No post-test holdout means you never learn whether shipped winners actually held up in production.

The fix is process, not tooling. A one-page test brief — hypothesis, primary metric, guardrails, sample size, planned duration, segments to be pre-registered — before any test starts. A standing 10% holdout that never sees shipped winners, so you can measure real cumulative lift quarterly. A post-mortem template for any test that surprised you, win or lose. This is the discipline that separates programs that compound from programs that thrash.

The 30-second test-validity check

Before you call any winner, answer these four: (1) Did the test reach its pre-committed sample size? (2) Is the assignment ratio within 1% of the intended split? (3) Did the primary metric move on both new and returning visitors? (4) Did revenue-per-visitor move in the same direction? Any "no" means you need to investigate before shipping.

Frequently asked

Frequently asked questions about A/B testing mistakes

Peeking — checking a test in flight and stopping when it crosses significance — is the most common and most damaging. It can inflate the false-positive rate from 5% to 25–35% depending on how often you peek. Commit to a sample size up front, or switch to a sequential testing method that accounts for repeated looks.

At least one full purchase cycle (typically 14–21 days for a store), and long enough to reach your pre-committed sample size. Running a test for a fixed week regardless of traffic almost guarantees you're either underpowered or peeking. Calculate sample size from your baseline conversion rate, MDE, and traffic, then commit.

A false positive is a statistically random fluctuation that looks like a lift. A novelty effect is a real but temporary behavior change — returning users react to the change itself, not the underlying improvement. Both produce inflated early readings, but the fixes differ: corrections and sample size for false positives, longer runtime and new-vs-returning segmentation for novelty.

Yes, anytime you're testing more than two variants or evaluating more than one primary metric. A Bonferroni correction (divide α by the number of comparisons) is the simplest. Holm-Bonferroni is slightly more powerful. If you're running a Bayesian framework, a hierarchical prior handles it more naturally.

Sample ratio mismatch (SRM) is when the actual traffic split deviates significantly from the intended one — say 48/52 instead of 50/50. It signals a tracking or assignment bug, which means the comparison itself may be biased. A simple chi-square test should run on every readout; if it flags SRM, the result is not interpretable until you find the cause.

Run a sample size calculation before launch: input your baseline conversion rate, the minimum detectable effect you care about, and your desired power (usually 80%). If the required sample size is more than 4–6 weeks of traffic at your current volume, the test is underpowered for that MDE — either pick a bigger swing to test or accept that smaller effects will be invisible.

Yes, for hypothesis generation — but not for ship decisions. Post-hoc segment findings should feed the next test's pre-registered hypothesis, not justify shipping the current variant to that segment. The discipline: pre-register the 2–3 segments you'll examine before launch, and treat anything else as exploratory.

At minimum: revenue per visitor (catches AOV trade-offs), page load time (catches technical regressions), and bounce rate or session depth (catches UX damage). For checkout tests, add refund rate and customer service contact rate measured 30 days post-test — some "wins" surface as costs later.

Test bigger swings (target 15–20% MDE instead of 5%), test higher in the funnel where volume is larger, batch micro-changes into themed variants, and accept slower velocity. A €1M store running 4 well-powered tests a year beats one running 40 underpowered tests — the latter is noise generation, not learning.

If your platform supports it, yes. A standing 5–10% holdout that never receives shipped winners lets you measure cumulative lift quarterly against a true counterfactual. This is the only reliable way to catch shipped losers that slipped through and to validate that your program is actually moving the business.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.

Launch your first experiment

How to use A/B Testing Mistakes

A/B Testing Mistakes

Statistical mistakes that inflate false positives

Design mistakes that bias the measurement

How an apparent A/B test lift decays as novelty wears off

Interpretation mistakes that turn noise into narrative

Common A/B testing mistakes by frequency and revenue impact

Process mistakes that compound across a program

Frequently asked questions about A/B testing mistakes

What is the most common A/B testing mistake?

How long should an A/B test run to avoid these mistakes?

What's the difference between a false positive and a novelty effect?

Do I need to correct for multiple comparisons on every test?

What is sample ratio mismatch and why does it matter?

How do I know if a test is underpowered?

Is it ever okay to slice an A/B test by segment after the fact?

What guardrail metrics should every test include?

How do I run A/B tests on low-traffic stores without these mistakes?

Should I always use a holdout group after shipping a winner?

Test ideas before you ship them