How to use A/B Testing Roadmap
An A/B testing roadmap is the artifact that turns scattered test ideas into a sequenced quarter of experiments. Here's how to build one your team will actually follow.
A/B Testing Roadmap
A sequenced plan of experiments — by funnel stage, surface, or hypothesis cluster — covering a quarter or year of CRO work.
An A/B testing roadmap is the planning artifact that turns a backlog of test ideas into an ordered queue with owners, dates, and expected impact. It typically spans a quarter (sometimes a full year), groups experiments by funnel stage or hypothesis theme, and accounts for traffic constraints so two tests don't fight for the same visitors.
The roadmap sits between your hypothesis library and your live experiment calendar. It answers three questions your stakeholders keep asking: what are we testing next, why that and not something else, and when will we know if it worked? Done well, it aligns CRO, product, and marketing on a single source of truth.
Most stores don't fail at A/B testing because they pick bad variants. They fail because every test feels ad-hoc — someone saw a LinkedIn post, the CEO has a hunch, the agency wants to justify retainer. A roadmap forces the question "why this test, this quarter?" before a single pixel moves.
The output is not a Gantt chart. It's a living document — usually a Notion page, Linear project, or shared spreadsheet — that lists upcoming experiments with hypothesis, primary metric, target segment, expected lift, and required traffic. The point is shared visibility, not bureaucracy.
Inputs: what feeds a credible roadmap
A roadmap built on opinions alone reads like a wishlist. A roadmap built on funnel data reads like a plan. You need three inputs before sequencing anything: quantitative drop-off data (GA4 funnel reports, session recordings), qualitative friction signals (heatmaps, support tickets, on-site polls), and business context (next quarter's promo calendar, planned PDP redesigns, inventory shifts).
The quantitative layer tells you where money is leaking. If your Shopify checkout converts at 38% but the industry median for apparel is closer to 50%, that's a six-figure annual gap and an obvious roadmap anchor. The qualitative layer tells you why — maybe shipping cost surprise, maybe a broken Apple Pay button on iOS Safari.
Business context is the input most roadmaps skip. There's no point testing a new PDP layout the week before Black Friday traffic spikes — you'll burn the learnings on atypical visitors. Map your test windows against the marketing calendar before you commit to dates.
Start with a 90-day audit, not a blank page
Pulling historical GA4 data into a structured drop-off audit usually surfaces 15-25 testable friction points within a day. That's the raw material for two quarters of roadmap — far more useful than a brainstorm session where the loudest voice wins.
Prioritisation: which tests go first
ICE (Impact, Confidence, Ease) and PIE (Potential, Importance, Ease) are the two scoring frameworks most teams reach for. They're imperfect but better than gut feel. The trick is being honest about Confidence — if your evidence is "a competitor does it," that's a 2, not an 8.
Beyond scoring, layer in two practical filters. First: traffic feasibility. A test on a page getting 800 sessions a week will take six weeks to reach significance on a realistic 8% lift — that's a quarter of your roadmap on one experiment. Second: blast radius. Header and cart tests affect every page; PDP-specific tests don't. Sequence high-blast-radius tests when no other concurrent test could be contaminated.
Typical roadmap allocation by funnel stage (quarterly)
The allocation above is a starting point, not a prescription. PDP and checkout tend to dominate because they sit closest to revenue. But if your funnel data shows traffic-quality issues, push more weight upstream into landing and category pages. The roadmap should reflect where your funnel actually leaks.
Cadence: how many tests per quarter is realistic
Test velocity is one of the most over-promised metrics in CRO. "10 tests a month" sounds great until you account for design, dev, QA, runtime, and analysis. For most stores in the €1M-€15M band, sustainable cadence is somewhere between 2 and 8 concurrent or sequential tests per month, depending on traffic and team.
The number isn't the goal — learning velocity is. Three well-designed tests that produce clear winners and losers teach you more than ten underpowered tests that all return inconclusive. Build the roadmap so each experiment can reach statistical significance within 2-4 weeks; if it can't, redesign the test or pick a higher-traffic surface.
Sustainable test cadence by monthly sessions
| Monthly sessions | Concurrent tests | Tests per quarter | Typical runtime |
|---|---|---|---|
| 50k - 150k | 1 | 4 - 6 | 3 - 4 weeks |
| 150k - 500k | 2 | 8 - 12 | 2 - 3 weeks |
| 500k - 1.5M | 3 - 4 | 15 - 20 | 10 - 14 days |
| 1.5M+ | 5+ | 25+ | 7 - 10 days |
Concurrent tests assume non-overlapping surfaces — a PDP test and a checkout test can run in parallel without interference; two PDP tests cannot, unless you're sophisticated about mutual exclusion. Most Shopify and WooCommerce stores under 500k monthly sessions should plan sequentially and resist the urge to overlap.
Governance: keeping the roadmap alive
A roadmap dies in two ways: nobody updates it after week three, or it becomes so detailed that updating it is a part-time job. The middle path is a weekly 30-minute roadmap review — what shipped, what's running, what's next, what got bumped. Anything that doesn't fit in that window doesn't belong on the roadmap.
Document the post-mortem for every concluded test directly on the roadmap entry: result, lift, p-value, and one-sentence learning. This is what compounds. Six months in, your roadmap becomes a searchable hypothesis library — "we tried free shipping thresholds at €50 in Q1, +3.2% AOV, ship to all" — and new test ideas inherit prior evidence.
Don't let HiPPO tests jump the queue
The fastest way to kill a roadmap's credibility is letting the highest-paid person's opinion bypass prioritisation. If a CEO request needs to run, fine — but score it like every other test and show the team what got bumped. Transparency keeps the system honest.
Frequently asked questions
One quarter in detail, one quarter in outline. Anything beyond six months is speculation — funnel data shifts, learnings from early tests redirect the queue, and seasonality changes what's worth testing. Replan at the end of each quarter.
Yes — kept as concluded entries with the result documented. Losing tests are evidence. If someone proposes the same idea six months later, you want a paper trail showing it was tried, what the lift was, and on which segment.
Usually the CRO lead or Head of E-commerce. The owner doesn't generate every hypothesis but does control sequencing, kills duplicate tests, and runs the weekly review. Shared ownership tends to mean no ownership.
Either pause the test during a major campaign (Black Friday, big launch) or accept that the segment's behaviour will be atypical and analyse pre/post separately. The worst option is ignoring the overlap and treating contaminated data as clean.
The backlog is the unfiltered pool of hypotheses. The roadmap is the sequenced subset you've committed to running, with dates and owners. Every roadmap item came from the backlog; not every backlog item makes the roadmap.
AI is good at proposing hypotheses from drop-off data and clustering them by funnel stage. Sequencing — which test runs first, accounting for traffic and business context — still benefits from human judgment. Use AI for the input, not the output.
Concentrate on high-traffic surfaces (PDP, checkout) and bigger swings (radical redesigns, not button colour). Low-traffic stores can't afford to test marginal changes — the runtime to detect a 2% lift is prohibitive. Pick tests with 8%+ expected lift.
Yes — "if we change X, then Y will improve because Z" is the minimum bar. Tests without a hypothesis produce results you can't generalise. The roadmap entry should fail review if the hypothesis is missing or doesn't connect to a metric.
The roadmap is one of three artifacts: the hypothesis backlog (raw ideas), the roadmap (sequenced queue), and the experiment log (concluded results). Together they make the testing programme legible to stakeholders who don't live inside it day-to-day.
Notion, Airtable, Linear, and Trello are the most common. The tool matters less than the columns: hypothesis, surface, primary metric, target segment, expected lift, status, owner, dates, result. Pick whatever your team already opens daily.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.