Experiment Analysis
The structured work between "test ended" and "decision made" — segment breakdowns, cohort splits, device cuts, and revenue attribution that turn an A/B test result into a confident rollout call.
Experiment Analysis
The structured post-test work of turning A/B test results into a rollout decision through segment, cohort, device, and revenue cuts.
Experiment analysis is everything that happens after a test reaches its stopping rule and before the team commits to ship, kill, or iterate. It covers statistical interpretation of the headline result, segment and cohort breakdowns to spot where the lift actually came from, device-level cuts to catch mobile-versus-desktop divergence, and revenue impact attribution to translate a conversion-rate delta into euros.
The goal is not to find a reason to ship — it's to understand the experiment well enough that the decision survives contact with reality after rollout. Done well, it's a 60-90 minute structured review. Done badly, it's a screenshot of a green arrow pasted into Slack.
Most experimentation programs are bottlenecked not by test velocity but by decision quality. A test that ends at 95% confidence with a 4.2% lift on conversion rate looks like a clear ship — until you find the entire effect came from one device, one traffic source, or one cohort that was already converting well.
That's the work experiment analysis exists to do. It sits downstream of experimentation and breaks into six tightly linked sub-disciplines: statistical interpretation, segment analysis, cohort analysis, device analysis, revenue impact, and experiment reporting. Each answers a different question about the same dataset.
Phase 1 — Read the headline number honestly
Start with statistical interpretation of the primary metric. Confirm the test hit its pre-registered sample size, that the stopping criterion wasn't moved mid-flight, and that the p-value or Bayesian probability matches what your test plan called significant. Anything else is a peek, not a result.
Then check the guardrails. A 5% lift on add-to-cart with a quiet 2% drop in average order value is not a win — it's a reshuffle. Pull bounce rate, AOV, revenue per session, and any product-margin guardrails you set during test design. If a guardrail moved meaningfully against you, the headline lift is on probation regardless of significance.
Phase 2 — Cut the data until the story changes
This is where segment analysis, cohort analysis, and device analysis do the heavy lifting. The aggregate lift is an average across very different shoppers. A new-PDP test might be flat overall but +9% on mobile and -4% on desktop — two findings hiding behind one number. Always cut by device first; on Shopify stores in the apparel and beauty verticals, mobile is typically 65-80% of sessions and behaves differently from desktop on almost every dimension.
Then cohort by traffic source (paid social vs organic vs email), by new vs returning visitor, and by landing-page entry point. Returning customers often dampen treatment effects because they've already learned the old flow. If your lift is concentrated in new visitors, that's a much stronger ship signal for a store running aggressive paid acquisition.
Beware the segment fishing expedition
Slicing the data into ten segments will produce a 'significant' result in at least one of them by chance alone. Pre-register the three or four cuts you care about during test design (device, new vs returning, top traffic source, top landing page) and treat anything beyond that as exploratory — interesting for the next hypothesis, not evidence for this decision.
Phase 3 — Convert lift into euros and write it down
Revenue impact is the translation layer between CRO and the rest of the business. Take the conversion-rate delta, apply it to projected annual sessions in the affected funnel, multiply by AOV, and adjust for margin. A 3.1% conversion-rate lift on a checkout funnel doing €4M a year in attributed revenue is roughly €124k incremental — the kind of number that gets a roadmap slot for the next iteration.
Then close the loop with experiment reporting. Every test — winner, loser, or inconclusive — gets a one-page writeup: hypothesis, primary and guardrail results, the three segment cuts, the revenue projection, and a ship/kill/iterate call with reasoning. Losers are worth as much as winners; a documented kill prevents the same idea coming back in six months under a different name.
How segment cuts can rewrite a headline result
Experiment analysis FAQ
For a standard A/B test with two variants and a clean primary metric, plan on 60-90 minutes of focused work plus a 30-minute readout. Tests with three or more variants, multiple guardrails, or surprising results legitimately take half a day. Anything faster usually means someone skipped the segment cuts.
Segment analysis splits the test population by an attribute that exists at the moment of the visit — device, traffic source, country, landing page. Cohort analysis groups users by when they entered the test or by behavioural history (new vs returning, prior purchaser, high-value cohort). Both belong in every readout because they answer different questions.
Only if you pre-registered that segment as a cut you cared about during test design. Otherwise you're almost certainly looking at multiple-comparisons noise. Treat unplanned segment wins as hypotheses for the next test, not as ship signals for this one.
Use the observed conversion-rate delta on the test sample, then project against a stable forward-looking traffic estimate — typically trailing 90-day sessions in the affected funnel, seasonally adjusted. Multiply by AOV and gross margin. Always report the projection as a range, not a single number, and discount by 20-30% for novelty effects in the first iteration.
At minimum: bounce rate on the affected page, AOV, revenue per session, and one operational metric (return rate, support contact rate, or page-speed proxy if the change touched front-end code). For checkout tests, add payment-method-mix and discount-code-usage as guardrails.
On most apparel, beauty, and homeware stores in the €1-15M revenue band, mobile is 65-80% of sessions but only 50-65% of revenue because desktop AOV is higher. Lifts on mobile therefore weigh more heavily on transaction count, while desktop lifts weigh more on revenue per session. Report both.
Three options: extend the test if you're materially under-powered and the test has been running less than your maximum runtime, kill it and document why the hypothesis didn't move the metric, or iterate with a sharper variant if the qualitative evidence (heatmaps, session replays) suggests the change wasn't bold enough. Don't just leave it running.
Compare the lift in week one against weeks two and three. If the effect decays meaningfully (more than 30% drop), you're looking at a novelty response from returning visitors. For tests on returning-heavy audiences, weight the later-week data more in your revenue projection, or rerun on new visitors only.
Yes — arguably more. A loss with no breakdown is a wasted test cycle. The segment and device cuts on a flat or negative result often reveal that one cohort responded strongly while another tanked it, which is the seed for the next hypothesis. Document every loser with the same one-page format as winners.
Experiment analysis is the closing phase of every test cycle and the opening input to the next. Statistical interpretation confirms the result is real, segment and cohort breakdowns explain where it came from, device analysis catches platform-specific surprises, revenue impact sizes the prize, and experiment reporting feeds the backlog. Skip any one and the program loses compounding.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.