Feature Experimentation

Metricuno
May 18, 2026
6 min read
Quick answer

Feature experimentation merges feature flags with CRO methodology — letting you ship product changes behind progressive rollouts and measure their real revenue impact before going to 100%.

Definition
Experimentation

Feature Experimentation

Testing product changes — not just UX tweaks — using feature flags, progressive rollouts, and holdout groups to measure real impact.

Feature experimentation is the practice of shipping new product behaviour behind a feature flag and exposing it to a controlled slice of traffic so you can measure its impact before going to 100%. It borrows the safety mechanisms of modern release engineering — canary releases, progressive rollouts, kill switches — and the measurement discipline of CRO.

Where a classic A/B test compares two button colours, feature experimentation tests whether a new checkout flow, a re-ranked product listing, or a price test actually moves revenue per visitor. The unit of change is bigger, the rollout is gradual, and the same flag that controls exposure also acts as your instant rollback.

Also known as
product experimentation
flag-based testing
controlled rollouts

The shift matters because the highest-leverage changes on a Shopify or WooCommerce store are no longer button colours — they're checkout logic, recommendation algorithms, bundle pricing, shipping-threshold messaging. Those changes touch code, not just CSS, which is why CRO tooling alone can't ship them.

Feature experimentation closes that gap. It lives downstream of broader experimentation practice, and it spans three child disciplines: feature flags (the on/off mechanism), progressive rollouts and canary releases (the exposure curve), and product experiments (the measured comparison). Get the three working together and you can ship riskier changes faster, with less rollback drama.

Phase 1 — Instrument the change behind a flag

Every experiment starts with a feature flag wrapping the new code path. The flag has two jobs: decide which visitors see the new behaviour, and emit an exposure event the moment that decision happens. Without that exposure event, your analytics can't tell who was actually in the test and who just happened to land on the page.

On Shopify this typically means a theme-level snippet plus a server-side check for cart and checkout logic; on WooCommerce a plugin hook does the same. Keep the assignment sticky per visitor (a hashed cookie or customer ID) so the same shopper sees the same variant across sessions — anything else corrupts the data and confuses returning customers.

Phase 2 — Roll out progressively, not all at once

A progressive rollout means you don't jump from 0% to 50% traffic on day one. A typical ramp on a mid-size store is 1% for 24 hours (the canary release — you're watching error rates, not conversion), then 5%, then 25%, then a stable 50/50 split for the real measurement period.

The early stages catch operational disasters — a bundle test that breaks tax calculation, a recommendation widget that hammers your API. The 50/50 stage is what gives you statistical power for the conversion-rate question. Treat them as separate gates: a clean canary doesn't mean the feature is good, only that it isn't on fire.

Don't conflate rollout and experiment

A 1% canary tells you whether the code works. It tells you almost nothing about whether the feature helps revenue — the sample is too small and too biased toward early-session traffic. Move to a 50/50 holdout for at least one full purchase cycle (often 2-3 weeks for considered categories) before you call a winner.

Phase 3 — Measure against a holdout, then decide

The measurement stage is a product experiment in the classical sense: control versus treatment, primary metric defined up front, sample size calculated before launch. The difference is that the same flag you've been using to ramp exposure is now the assignment mechanism — no separate test tool, no second snippet, no risk of double-counting visitors who land in both systems.

Pick a primary metric tied to revenue (revenue per visitor, AOV, or completed-checkout rate — not raw clicks) and pre-register a guardrail or two: page load time, return rate, customer-service ticket volume. A 3% lift in checkout conversion that comes with a 1.5% jump in refund requests isn't a win, and you won't catch it unless you wrote the guardrail down before launch.

Chart

Conversion rate by rollout cohort — typical pattern for a checkout-flow feature test

0%0.5%1%1.5%2%2.5%3%3.5%1% canary5% ramp25% ramp50/50 holdWeek 2 of 50/50Checkout conversion rateRollout stage

Treatment (new flow)

Control (existing flow)

The chart shows a pattern worth recognising: early ramp stages look noisy and over-optimistic, then the treatment settles into a smaller but real lift as the sample matures. Calling a winner at the 5% stage would have over-stated the effect by nearly 2x. This is also where novelty effect bites — returning customers reacting to the change itself, not its merit. Watching the second week of stable 50/50 traffic is usually how you tell.

Once you have a result, the flag does its third job: it becomes the rollout switch. A clean win goes to 100% behind the same flag (which you keep in place for a few weeks in case you need to revert). A loss gets flipped off in seconds. A flat result usually means the change wasn't bold enough — file it, learn from it, and pick a bigger hypothesis next time.

Frequently asked

Feature experimentation FAQ

A/B testing usually refers to UX experiments — copy, layout, button changes — run through a visual editor on the front end. Feature experimentation tests product behaviour that lives in code: pricing logic, checkout steps, recommendation algorithms. The methodology is similar; the unit of change and the tooling are not.

Practically, yes. Feature flags are the mechanism that lets you expose a code change to a defined slice of traffic, roll it back instantly, and tie exposure events to your analytics. You can fake it with code branches and deploys, but you lose the kill switch and the per-visitor assignment, which is most of the value.

A canary release is the first tiny stage (often 1%) where you're checking that the new code doesn't break anything — errors, latency, payment failures. A progressive rollout is the broader ramp from canary up through 25% and 50% as confidence grows. Canary is a subset of progressive rollout.

Long enough to capture at least one full purchase cycle for your category and to hit your pre-calculated sample size. For a typical apparel or beauty store that's 2-3 weeks at a 50/50 split; for considered purchases (electronics, furniture) plan on 4-6 weeks. Stopping early because the curve looks good is the most common mistake teams make.

For UX-level changes, yes — most visual editors handle that. For genuine feature experimentation (checkout logic, pricing, recommendations) you'll need either a developer or a platform whose flag plugin wires into Shopify theme and Liquid templates for you. Metricuno's plugin is built to handle the latter case.

At minimum: page load time, refund/return rate, and customer-service contact rate. For checkout-flow tests add payment failure rate. The point of guardrails is to catch features that boost the primary metric while quietly damaging something downstream — they should be defined before launch, not after.

UX experiments are a subset focused on presentation-layer changes — what visitors see. Feature experiments include those plus behavioural changes — what the system does. Most mature teams run both through the same flag infrastructure so results are comparable and visitors aren't enrolled in conflicting tests.

A holdout is a slice of traffic (often 5-10%) that never sees any new features, so you can measure cumulative impact over months. It's worth setting up once you're running more than a couple of experiments a month — individual test lifts often don't compound the way you'd expect, and the holdout is how you find out.

Run them together but segment the analysis. Lumping them risks missing a strong mobile lift drowned out by a flat desktop result (or vice versa). Splitting the test itself usually halves your statistical power for no good reason — better to power the combined test and slice the data.

Calling winners during the ramp stage instead of at stable 50/50 traffic. Early-ramp numbers are biased toward eager early-session visitors and have wide confidence intervals; the lift you see at 5% rarely survives to 50%. Wait for the holdout phase before declaring anything.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.