A/B Testing
A working framework for A/B testing in online retail — how to design tests that survive scrutiny, run them long enough to trust, and decide without fooling yourself.
A/B Testing
A randomized split test that compares two versions of a page or flow to decide which one performs better on a chosen metric.
A/B testing is the controlled experiment behind every serious conversion program. Visitors are randomly assigned to a control (A) or a variant (B), each sees a different version of the page, element, or flow, and a primary metric — usually conversion rate, revenue per visitor, or add-to-cart rate — decides the winner.
The method works because randomization neutralizes confounders: traffic source, device, day of week, weather, promotions. With a large enough sample and a pre-declared metric, the difference you measure is attributable to the change you made, not noise. It's the difference between knowing the new product page lifts revenue 4.1% and hoping it does.
A/B testing sits inside the broader practice of experimentation, but it's the workhorse variant — cheaper to run than a multivariate test, easier to interpret than a holdout, and the format every major testing tool is built around. If you only run one type of experiment, this is it.
This page is the framework view: how the pieces fit. For the strict definition see what is A/B testing, for tactics see the A/B testing process, for shortlists of vendors see A/B testing tools, and for concrete inspiration see A/B testing examples. The rest of this page covers the three phases that decide whether your program produces real lift or theatre — design, run, decide.
Phase 1 — Design: hypothesis, metric, sample
A test starts with a hypothesis in the form: "If we change X, then Y will improve, because Z." The Z matters most. "Because session recordings show 38% of mobile users scroll past the size selector" is a hypothesis. "Because the CEO prefers blue" is not. Weak hypotheses are the single biggest predictor of flat tests.
Then pick one primary metric and freeze it before the test starts. For a product detail page that's usually add-to-cart rate or revenue per visitor; for a checkout step it's completion rate. Picking the metric after you've seen the data is how teams accidentally ship variants that hurt revenue while "winning" on bounce rate. The A/B testing framework guide goes deeper on metric selection.
Phase 2 — Run: traffic, duration, discipline
Once the test is live, two numbers matter: how much traffic each variant has seen, and how long the test has been running. You need both. Hitting your sample-size target in 36 hours doesn't mean you can call the test — you'll have under-sampled Sundays, returning customers, and the post-payday spike. A minimum of two full weeks is the working rule for most online stores.
Discipline during the run is mostly about what you don't do: don't peek at the results and stop early when the variant looks ahead, don't add a second variant mid-test, don't change the creative because the founder wants to. Each of these breaks the statistical guarantees the test depends on. The A/B testing mistakes page catalogues the rest.
The peeking problem
Checking results daily and stopping the moment p < 0.05 inflates your false-positive rate from the advertised 5% to roughly 25-30%. One in three "winners" called this way is noise. Either pre-commit to a fixed sample size and check only at the end, or use a sequential testing method (mSPRT, Bayesian) that's designed for repeated looks.
Phase 3 — Decide: read the result, then act on it
A test ends in one of three places: clear winner, clear loser, or inconclusive. Inconclusive is the most common outcome — and the most useful, because it tells you the change wasn't big enough to detect at your current traffic. The mistake is treating inconclusive as "ship it anyway, it didn't hurt." Inconclusive means you don't know.
For winners, document the lift, the segment it came from (often mobile-only or new-visitor-only), and feed it back into the next hypothesis. A mature A/B testing program management practice turns every result — win, loss, flat — into a learning the next test builds on. The A/B testing roadmap is how you sequence those bets across a quarter.
Visitors per variant needed to detect a lift (baseline 3% conversion, 80% power, 95% confidence)
A/B testing FAQ
At minimum two full business cycles — usually two weeks for an online store, so you cover both weekends, mid-week patterns, and any returning-customer behaviour. Stopping shorter than that, even at statistical significance, risks calling noise a winner because you under-sampled certain traffic types.
It depends on your baseline conversion rate and the lift you want to detect. A store with 3% conversion and 20,000 monthly visitors can reliably detect ~15% relative lifts; below 10,000 monthly visitors per variant, only large redesigns (>25% lift) are practical to measure. Use a sample-size calculator before designing the test.
A/B testing compares two complete versions head-to-head. Multivariate testing (MVT) tests multiple element combinations at once — say, three headlines × two images × two CTAs — and isolates the contribution of each. MVT needs roughly N× the traffic where N is the number of combinations, so it's rarely practical below 100k monthly visitors.
Yes, if they're on different pages or different audiences. Running two tests on the same page risks interaction effects — the variants influence each other and you can't cleanly attribute lift. Most teams sequence tests on the same page and parallelize across the funnel (PDP, cart, checkout).
95% confidence (p < 0.05) is the default and works for most commercial decisions. Some teams drop to 90% for low-risk changes to ship faster, and raise to 99% for irreversible decisions like checkout redesigns. Whatever you pick, set it before the test, not after seeing the data.
Yes, but with adjusted expectations. Below 10k monthly sessions, restrict tests to high-impact changes (full PDP redesign, new checkout flow, hero block) and accept that you'll only catch lifts above ~20%. Small-button-colour tests are mathematically hopeless at that traffic level — focus on bigger swings.
Always analyse the result by device, even if you run one combined test. The two audiences behave differently enough that a desktop winner is sometimes a mobile loser. If the segment split is large and traffic allows, run separate tests so you can ship the right variant to the right device.
Mature programs see 15-25% of tests produce a shippable winner, 30-40% flat or inconclusive, and the rest negative. If your win rate is above 50%, you're probably calling tests too early or testing only obvious changes. A higher proportion of bold tests is healthier than a high "win rate" on small ones.
Personalization is what you do after A/B testing reveals segment-level differences. The test tells you variant B wins overall but loses on returning customers; personalization ships variant A to returning customers and variant B to new ones. Without the test data first, personalization is just guessing in more places.
For simple copy, image, and layout changes — no. Modern A/B testing tools include visual editors that ship via a single snippet on Shopify, WooCommerce, or Magento. For complex flow changes (multi-step checkout, server-side logic) you'll want developer involvement to avoid flicker and to instrument the right events.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.