How to use A/B Testing Process
A practical walkthrough of the eight stages every A/B test moves through — from hypothesis prioritization to post-test documentation — with benchmarks and pitfalls at each step.
A/B Testing Process
The end-to-end operating procedure for running a controlled experiment: prioritize, design, QA, launch, monitor, analyze, decide, and document.
The A/B testing process is the operational checklist an experimentation team follows for every test it ships. It turns a backlog of ideas into a repeatable pipeline with eight stages: prioritize, design, QA, launch, monitor, analyze, decide, and document. Each stage has its own artefacts — a scored hypothesis, a test plan, a QA sign-off, a results readout, a decision log.
Mature CRO teams treat the process itself as the asset. Win rates on individual ideas vary, but a disciplined process compounds: cleaner data, fewer false positives, faster cycle time, and institutional memory that prevents you from re-running the same test in 18 months.
Most teams don't lose at A/B testing because their ideas are bad. They lose because the process around the ideas is loose — tests launch without QA, run too short, get called early, or finish without anyone writing down what was learned.
This guide walks through all eight stages with the artefact each one produces, the typical failure mode, and a realistic time budget. By the end you should be able to audit your own workflow and spot which stage is leaking velocity or rigour.
Stage 1-2: Prioritize and design
Prioritization is where most programs over-think and under-ship. You don't need a perfect score — you need a defensible ranking. ICE, PIE, and PXL frameworks all work; pick one, score every backlog item the same way, and review the top five every two weeks.
Design turns the top-ranked idea into a test plan: a single primary metric, a clear hypothesis ("if we do X, Y will change because Z"), a target audience, a minimum detectable effect, and a calculated sample size. Skip the sample-size calculation and you're guessing how long to run.
The artefact at this stage is a one-page test brief. Anyone on the team should be able to read it and explain what's being tested, on whom, and how you'll know if it worked. If the brief takes more than 15 minutes to write, the idea is probably underdeveloped.
Common failure: skipping the hypothesis
"Let's test a red button vs. blue" is not a hypothesis — it's a guess. A real hypothesis names the user behavior you expect to shift and why. Without one, a winning variant teaches you nothing transferable, and a losing variant teaches you even less.
Stage 3-4: QA and launch
QA is the stage teams cut corners on, then pay for in invalidated results. Before any test goes live, check it on Chrome, Safari, and mobile Safari, with cookies cleared, on the actual checkout flow — not just the landing page. Verify the analytics event fires once per user, not once per pageview.
Sample ratio mismatch (SRM) is the silent killer here. If your 50/50 split lands at 47/53 in the first 48 hours, something is broken — bot traffic hitting one variant, a redirect loop, a cache layer serving stale assets. Pause the test and find it before you trust a single number.
Where time is actually spent across an 8-stage A/B test
Launch itself should be boring. If the test brief is solid and QA is clean, flipping the toggle takes 10 minutes. The teams that ship the most tests aren't the ones with the fastest launches — they're the ones whose QA never sends them back to the drawing board.
Stage 5-6: Monitor and analyze
Monitoring is not analyzing. While the test is live you're looking for one thing: signs the test is broken. SRM, flat traffic to one variant, error spikes in checkout, a dramatic drop in revenue per visitor. If any of those appear, you pause. You do not peek at conversion rate and make calls.
Analysis happens once — after the test has reached its pre-declared sample size and run for at least one full business cycle (typically two weeks). Look at the primary metric first. Only then look at secondary metrics, segment cuts, and qualitative signals. Reversing that order is how you talk yourself into shipping noise.
Typical A/B test cycle time and win rates by store size
| Store revenue tier | Median test duration | Tests per quarter | Win rate (primary metric) | Median lift on winners |
|---|---|---|---|---|
| €1M-€3M Shopify store | 18 days | 4-6 | 18-22% | +6.4% |
| €3M-€8M Shopify store | 14 days | 8-12 | 20-25% | +4.8% |
| €8M-€15M multi-store | 11 days | 14-20 | 22-28% | +3.6% |
| Mature program (3+ yrs) | 10 days | 20-30 | 25-32% | +2.9% |
Notice the pattern: more mature programs ship more tests, win more often, but with smaller lifts. That's not regression — it's the natural arc of CRO. The easy wins go first, and the discipline of the process is what keeps the program profitable once the obvious fixes are gone.
Stage 7-8: Decide and document
Decide means a single owner names the outcome: ship, kill, or iterate. "Inconclusive, let's run it longer" is a non-decision and usually means the test should have been killed. Write the decision down with the date, the primary-metric result, and the confidence interval.
Documentation is the stage that separates programs that compound from programs that thrash. A two-paragraph readout — hypothesis, result, what we learned, what we'd test next — saved in a searchable place is enough. Skip this and you'll re-litigate the same checkout copy debate every six months.
What good documentation looks like
Every test gets a stable URL with: brief, screenshots of both variants, primary and secondary metric results, segment cuts, the decision and who made it, and one paragraph on what to test next. New hires should be able to read the last 20 tests in an afternoon and know the state of the program.
Frequently asked questions
For a mid-sized store, allow roughly 1 day for prioritization and design, half a day for QA, 10-14 days live, and a day for analysis, decision, and documentation. Total elapsed time is 2-3 weeks per test, but most of that is calendar time waiting for sample size — not active work.
A/B testing is the statistical method — comparing two variants on a metric. The A/B testing process is the operational workflow around it: how ideas get prioritized, how tests get QA'd, how results get documented. You can do A/B testing without a process, but you'll waste most of what you learn.
Not initially. A shared spreadsheet with one row per test — hypothesis, status, primary metric, decision, link to readout — works fine up to about 20 tests per quarter. Beyond that, dedicated experimentation platforms or a Notion/Airtable database start paying for themselves.
Common split: a CRO specialist owns prioritization, design, analysis, and documentation. A developer or no-code editor owns implementation and launch. QA is shared. The decision sits with whoever owns the primary metric — usually the head of e-commerce.
First check whether the minimum detectable effect was set realistically — if you're trying to detect a 1% lift on a low-volume page, no process will save you. If the math is right, run for one full business cycle minimum (two weeks), and stop at the pre-declared sample size regardless of how the numbers look mid-flight.
Revenue per visitor (RPV) for anything touching the funnel — it captures both conversion rate and average order value in one number. Use conversion rate alone only when AOV can't realistically move (e.g. a single-product page). Never optimize for click-through on a button if revenue downstream is what you actually care about.
Two things: hide the live dashboard from anyone who isn't on the experimentation team, and publish a weekly readout that shows only tests that have hit their sample size. Peeking is a stats problem, but the fix is mostly organizational — remove the temptation.
20-25% of tests producing a statistically significant positive result on the primary metric is healthy. Higher than 35% suggests your tests are too conservative or you're cherry-picking metrics post-hoc. Lower than 15% usually means weak hypotheses or insufficient sample size.
Yes, on separate pages or audiences. Running two tests on the same page risks interaction effects and complicates analysis. A practical rule: one test per funnel step at a time, multiple tests across the site is fine. Map out the funnel and assign tests to non-overlapping zones.
The losing variant gets shut off, but the work isn't wasted. Document why you think it lost, what segments behaved differently, and what the result tells you about user behavior. A documented loss often produces the next test's hypothesis — that's where the compounding comes from.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.