RICE Scoring

Metricuno
May 18, 2026
4 min read
Quick answer

RICE scoring prioritizes experiments by multiplying Reach, Impact, and Confidence, then dividing by Effort. It favors sitewide tests over niche-segment ones — here's how to use it.

Definition
Experimentation

RICE Scoring

A prioritization formula that ranks experiment ideas by Reach × Impact × Confidence, divided by Effort.

RICE scoring is a quantitative framework for ranking which experiments to run next. Each idea gets four numbers — Reach (how many users will see it in a given period), Impact (how strongly it moves the target metric per user), Confidence (how sure you are the impact will materialize), and Effort (person-weeks to build and run it). The formula multiplies the first three and divides by the last, producing a single score you can sort a backlog by.

Developed at Intercom, RICE is the most rigorous of the lightweight prioritization frameworks because it explicitly weights ideas by audience size. That makes it a natural fit for CRO teams choosing between sitewide changes and niche-segment tweaks.

Also known as
RICE framework
RICE prioritization

RICE sits inside the broader practice of experiment prioritization, alongside lighter frameworks like ICE and PIE. What sets it apart is the Reach term: a homepage hero test that 500,000 visitors will see can outscore a checkout micro-copy tweak that only 8,000 logged-in shoppers will reach, even if the checkout idea has higher per-user impact.

On a Shopify or WooCommerce store, RICE is most useful when your backlog mixes funnel stages — a PDP layout test, a free-shipping threshold, a mobile menu redesign. The Reach term forces you to be honest about how much traffic each surface actually gets before you spend two sprints building the variant.

Formula

RICE = (Reach × Impact × Confidence) / Effort

Variables

Reach

Reach

Number of users who will encounter the test in a fixed period (usually one quarter).

Impact

Impact

Expected lift per user on the target metric. Scored 3 (massive), 2 (high), 1 (medium), 0.5 (low), 0.25 (minimal).

Confidence

Confidence

How sure you are the impact will hold, expressed as a percentage (e.g. 80% = 0.8). 100% = strong evidence, 50% = a hunch.

Effort

Effort

Person-weeks needed to design, build, QA, and run the test.

Worked example

An apparel store is deciding between two test ideas: (A) a sticky 'Add to cart' button on mobile PDPs, (B) a new size-guide modal for the denim category.

A — Reach (quarterly mobile PDP views): 420000

A — Impact: 1

A — Confidence: 0.8

A — Effort (person-weeks): 2

B — Reach (quarterly denim PDP views): 55000

B — Impact: 2

B — Confidence: 0.7

B — Effort (person-weeks): 3

A scores 168,000. B scores 25,667.

The sticky CTA wins by a wide margin, driven almost entirely by Reach. Even though the size-guide has double the per-user Impact, it only touches a denim sub-segment — exactly the bias RICE is designed to surface.

Absolute RICE scores are not comparable across teams — they depend on how you define Reach (visitors? sessions? exposed users?) and what time window you use. What matters is the relative ranking inside one backlog, scored consistently by the same person or rubric.

Benchmark

Example RICE scores across a typical CRO backlog

Test ideaReach (qtr)ImpactConfidenceEffort (wks)RICE score
Sticky mobile add-to-cart420,0001.080%2168,000
Free-shipping threshold banner380,0001.070%1266,000
Homepage hero rotation500,0000.560%275,000
Denim size-guide modal55,0002.070%325,667
Post-purchase upsell flow32,0002.050%48,000
Checkout trust badges180,0000.550%145,000

Watch for two failure modes. First, Confidence inflation — teams quickly drift toward scoring everything 80-90%, which collapses the term. Anchor it with evidence tiers: 100% = prior winning test, 80% = strong analytics signal, 50% = informed guess. Second, Effort optimism: pad estimates for QA, analytics setup, and the inevitable rebuild.

Frequently asked

Frequently asked questions

ICE uses Impact, Confidence, and Ease — three scores, usually 1-10, multiplied together. RICE adds Reach as a fourth term and replaces Ease with Effort (in person-weeks). The Reach term is the substantive difference: it pulls scores toward high-traffic surfaces, which ICE doesn't do.

Use a fixed scale: 3 = massive impact, 2 = high, 1 = medium, 0.5 = low, 0.25 = minimal. Resist inventing values between tiers — the discrete scale is what keeps scoring consistent across raters. Anchor each tier to a concrete lift range (e.g. 1 = expected +2-5% on the primary metric).

A quarter is the standard window, because it matches typical experiment cycles and roadmap planning. The exact window matters less than consistency — if you score one idea on quarterly Reach, score all of them that way.

It works, but you have to be honest about Reach. A personalized variant only shown to a 12% audience segment has 12% of the sitewide Reach, which usually drops it below sitewide tests. That's a feature, not a bug — personalization should clear a higher Impact bar.

Engineers. Effort is a build estimate, not a wish. Have the developer who'll do the work give a person-week number that includes design, QA, instrumentation, and analysis time — not just the build.

No. RICE ranks a backlog; it doesn't generate ideas, validate hypotheses, or check statistical feasibility. Use it as the scoring step inside a wider prioritization workflow that also covers hypothesis quality and minimum detectable effect.

Because Reach is a linear multiplier. If you want segment tests to compete, either separate the backlog into 'sitewide' and 'segment' lanes with their own RICE rankings, or cap Reach at a ceiling (e.g. 200k) so it stops dominating.

Not very — and that's fine. RICE is a decision aid, not a forecast. Studies of internal prioritization frameworks show only weak correlation between predicted and actual lift; the value is in forcing a consistent conversation, not in the number itself.

Business confidence in the hypothesis — how likely you think the change will produce the expected Impact. Statistical power is handled separately at the test-design stage when you calculate sample size and minimum detectable effect.

Re-rank when material new data lands — a related test result, a traffic shift, a roadmap change. Most teams do a light rescoring every two to four weeks and a full rebuild quarterly. Daily fiddling with scores wastes time and erodes trust in the framework.

Test ideas before you ship them

Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.