How to use Experiment Reporting
A practical guide to writing experiment reports that actually drive decisions — what to include, what to cut, and how to present lift, segments, and recommendations without overclaiming.
Experiment Reporting
The practice of communicating A/B test results — lift, segments, evidence, and a recommendation — to the people who decide whether to ship.
Experiment reporting is the last mile of experimentation: translating a finished test into a document or dashboard that stakeholders can read, trust, and act on. A good report shows the variant creative, the headline lift with its confidence interval, the segment cuts that matter, the guardrail metrics, and one clear recommendation — ship, don't ship, or iterate.
It sits downstream of experiment analysis, which produces the numbers; reporting decides what to surface and how to frame it. Done consistently, it builds organisational trust in the testing program. Done badly — cherry-picked segments, missing confidence intervals, buried losses — it teaches stakeholders to discount everything experimentation says.
Most A/B tests don't fail at the statistics step. They fail at the communication step — the report that goes to the merch director, the head of e-commerce, or the agency client. If the recommendation isn't clear, the test gets re-litigated in a meeting two weeks later, or worse, gets shipped on a hunch that contradicts the data.
The job of the report isn't to prove you ran a good experiment. It's to give a non-statistician enough context to make a confident ship/no-ship call in under five minutes. Everything in this guide is in service of that.
Anatomy of a report that gets read
A report stakeholders actually read has six sections, in this order: hypothesis, variant screenshots, headline result, segment cuts, guardrails, recommendation. That sequence mirrors how a busy reader scans — what were we testing, what did it look like, did it work, for whom, did anything break, what now.
The hypothesis section is one sentence: "We believed that adding sticky add-to-cart on mobile PDPs would lift add-to-cart rate by 5%+ because thumb-zone friction is the bottleneck." Belief, expected effect, mechanism. If you can't write this in one line, the test wasn't well-scoped.
Variant screenshots come next — control on the left, variant on the right, mobile and desktop if relevant. Sounds obvious, but half the readouts in circulation skip this and force readers to dig through Figma. The image is the fastest way for a merchandiser to understand what actually shipped to users.
The headline result is one number — not a table
Lead with the primary metric, the relative lift, and the confidence interval, in one sentence. Example: "Add-to-cart rate rose from 8.4% to 9.1%, a relative lift of +8.3% (95% CI: +2.1% to +14.5%)." If a reader takes one thing from the report, this is it. Burying the headline under a six-row metric table is the most common reporting mistake.
Segments and guardrails: where reports get honest
After the headline, pre-declared segment cuts. Mobile vs desktop, new vs returning, top traffic source, and any vertical-specific cuts (subscription vs one-off for a beauty store, men's vs women's for apparel). Pre-declared is the key word — segments you committed to before unblinding, not the seventeen slices you discovered after the fact.
Guardrail metrics — revenue per visitor, AOV, return rate, page load time, bounce — go in their own row. A positive add-to-cart lift that tanks RPV by 4% is a losing test, and the report should say so on the first page, not page four.
Stakeholder decision confidence by report format
The jump from a numbers-only readout to a structured report with a recommendation is roughly 30 points of stakeholder confidence. Adding segment cuts and guardrails takes you another ten. The work of formatting compounds — every reader after the first benefits from the structure you put in once.
Writing the recommendation
The recommendation is three words minimum and a paragraph maximum. "Ship to all traffic." "Ship to mobile only — desktop showed no effect." "Don't ship — RPV guardrail failed." "Iterate — directional positive but underpowered, propose follow-up with 2x traffic allocation." Make the call. Hedging ("results are interesting but...") forces every reader to make the call themselves, and they'll each make a different one.
Recommendation strength should match evidence strength. A clean win with tight confidence intervals, healthy sample, and no guardrail damage gets a confident ship. A noisy result with one segment driving everything gets an iterate, not a ship. Calibrating this — and being seen to calibrate it — is how reporting builds trust.
Typical time-to-decision by report quality (post-readout to ship/kill call)
| Report quality | Median time-to-decision | % requiring follow-up meeting | % later reversed |
|---|---|---|---|
| Stats dump, no recommendation | 9 days | 78% | 22% |
| Structured numbers, no recommendation | 5 days | 55% | 14% |
| Structured + clear recommendation | 1 day | 18% | 6% |
| Structured + recommendation + segments + guardrails | Same day | 9% | 3% |
The reversal column matters most. Decisions made off weak reports get unwound — someone challenges the result in a later meeting, the test gets re-run, the team loses six weeks of velocity. Strong reports compound: each one earns the next test more credibility on day one.
Common reporting failures that erode trust
Cherry-picked segments. "It didn't win overall, but mobile-new-users-from-paid-social showed +12%!" Unless that segment was pre-declared, this is post-hoc storytelling and the stakeholder will eventually notice. Report the overall result, then the pre-declared cuts, then exploratory findings clearly labelled as such — "hypothesis-generating, not conclusive."
Missing confidence intervals. A "+6% lift" with no CI is meaningless — it could be +1% to +11% (probably real) or -4% to +16% (basically noise). Stakeholders who get used to point estimates without ranges will start treating every directional result as a win, and your program credibility collapses the first time one reverses in production.
Standardise the template; vary the contents
The single highest-leverage move in experiment reporting is a fixed template every test uses — same six sections, same metric definitions, same chart formats. Readers learn it once and then read every subsequent report in 90 seconds. Metricuno auto-generates the headline numbers, CIs, and segment cuts directly from the experiment analysis layer, so the only thing you write is the hypothesis line and the recommendation.
Experiment reporting FAQ
Experiment analysis produces the numbers — lift, p-values, confidence intervals, segment splits. Reporting decides which of those numbers to surface, in what order, with what framing, and what recommendation to attach. Analysis is statistical; reporting is communicative.
One page or one Notion doc that scrolls under a screen. If a stakeholder has to scroll three times to find the headline result, the report is too long. Appendices for full segment data and methodology are fine — they sit below the recommendation, not above it.
Yes, always. Hiding inconclusive results trains the team to expect every test to win and silently inflates the program's apparent hit rate. Report the test, note it was underpowered or null, and recommend either kill, iterate with more traffic, or shelve.
Frame the learning, not the metric. "Variant matched control on RPV — sticky add-to-cart isn't the lever we thought it was on this PDP. Next hypothesis: above-fold price reframing." A flat test that produces a sharper next hypothesis is a productive test.
Pick the cleanest action available. "Ship to mobile, hold desktop" if segments diverge sharply. "Iterate with a bolder treatment" if directional but underpowered. "Don't ship" if guardrails failed. Never write 'further investigation needed' as a final recommendation — that's an abdication.
Write for the most senior non-technical stakeholder who'll read it — usually the Head of E-commerce or the brand owner. They need the result, the visual, the recommendation, and enough segment context to ask one smart follow-up. Engineers and analysts can read the same doc; the reverse isn't true.
Yes. The mock and the shipped variant often differ in subtle ways — copy tweaks, image swaps, mobile rendering. The screenshot is the artefact of what users actually saw, and it's what a future reader (or a new hire reading the test archive) will rely on.
Say so explicitly in the headline. "Overall lift +4.2% is driven entirely by returning mobile users (+11%); new users were flat." Then recommend shipping to the winning segment only, or iterating to find a treatment that works for the flat segment. Burying segment heterogeneity is one of the fastest ways to ship a result that doesn't replicate.
Per test for the full readout, plus a weekly or bi-weekly digest summarising what shipped, what didn't, and what's running. The per-test report is the source of truth; the digest is the executive scan. Both reference the same template.
Same template, plus a 'business impact' line that translates the lift into projected annualised revenue at current traffic. Clients pay for outcomes, not p-values — show the statistical confidence and the euro impact side by side, and lead the conversation with the recommendation.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.