Bayesian Testing
Bayesian testing reports the probability that variant B beats A and updates as data arrives — letting you stop early without inflating false positives the way frequentist peeking does.
Bayesian Testing
An A/B testing framework that reports the posterior probability a variant beats control, updated continuously as new data arrives.
Bayesian testing is an approach to experiment analysis that combines a prior belief about conversion rates with observed data to produce a posterior probability — most often phrased as 'P(B > A)' or 'probability to be best'. Instead of asking 'how surprising would this result be if there were no real effect?' (the frequentist p-value question), it answers the question the reader actually wants: 'given what I've seen, how likely is variant B to win?'
Because every update is a fresh, self-contained probability statement, Bayesian methods don't carry the peeking penalty that inflates frequentist false-positive rates. The trade-off: you must pick a prior, and you interpret expected loss and credible intervals instead of significance thresholds.
The frequentist alternative — t-tests, z-tests, p-values — assumes a fixed sample size and treats each look at the data as a separate hypothesis test. Peek ten times and your real false-positive rate balloons well past the nominal 5%. Bayesian testing sidesteps this because the posterior is just a current belief, not a repeated-trials statement.
In practice this matters for online stores running short cycles. You can ship a variant at 95% probability-to-be-best after one weekend of traffic without violating the math, where a frequentist test would still demand the pre-registered sample. The cost is conceptual overhead: stakeholders need to understand that 'P(B > A) = 92%' is not the same statement as a 92% confidence interval.
P(θ_B > θ_A | data) ∝ P(data | θ_A, θ_B) × P(θ_A) × P(θ_B)
θ_A
Control conversion rate
The unknown true conversion rate for variant A, modelled as a probability distribution.
θ_B
Variant conversion rate
The unknown true conversion rate for variant B, modelled as a probability distribution.
P(θ)
Prior
Your belief about conversion rates before seeing data — typically a Beta(α, β) distribution for binary conversion events.
P(data | θ)
Likelihood
How probable the observed conversions are given a hypothesised conversion rate.
P(θ | data)
Posterior
Updated belief about the conversion rate after combining prior and data — the basis for P(B > A).
A Shopify apparel store tests a new product-page hero against control. After 7 days: control 8,400 sessions / 252 conversions (3.00%), variant 8,200 sessions / 287 conversions (3.50%). Using a weak Beta(1, 1) prior, the posterior for control is Beta(253, 8149) and for variant is Beta(288, 7914).
Control conversions / sessions: 252 / 8,400
Variant conversions / sessions: 287 / 8,200
Prior: Beta(1, 1) — uninformative
→ P(B > A) ≈ 96.8%, expected uplift ≈ +16.7%, expected loss of choosing B ≈ 0.02%
There's a 96.8% probability variant B has a higher true conversion rate than control, and the cost of being wrong is negligible — a reasonable basis to ship even though a strict frequentist test at α=0.05 would be marginal here.
Pair P(B > A) with expected loss to decide. Probability-to-be-best tells you which variant is likely better; expected loss tells you how bad it would be to pick the loser. A common stopping rule: ship when P(B > A) > 95% AND expected loss of the chosen variant < 0.1% of baseline conversion rate.
How Bayesian and frequentist testing behave on the same e-commerce experiment
| Behavior | Bayesian | Frequentist (fixed-horizon) |
|---|---|---|
| Output | P(B > A), expected loss, credible interval | p-value, confidence interval |
| Peeking allowed? | Yes — posterior updates are self-contained | No — inflates Type I error past nominal α |
| Stopping rule | P(B > A) > 95% and expected loss < threshold | Reach pre-registered sample size |
| Typical time to decision (3% baseline, +10% MDE) | 10-14 days | 16-21 days |
| Requires a prior? | Yes — weak/uninformative is fine | No |
| Handles small samples | Gracefully (prior regularises) | Poorly (relies on asymptotic approximations) |
| Stakeholder interpretation | Direct: 'B is 96% likely to win' | Indirect: 'p=0.03 under H0' |
Most modern experimentation tools — including Metricuno — default to Bayesian reporting because it matches how operators actually make decisions. You still need guardrails: don't ship on day one with 30 conversions per arm just because P(B > A) clears 95%, and document your prior so results stay reproducible.
Bayesian testing FAQ
Frequentist testing asks how surprising your data would be if there were no real effect, expressed as a p-value with a fixed sample size. Bayesian testing combines a prior with the data to give a direct probability that one variant beats another, and can be checked at any time. The frameworks usually agree at large sample sizes; they diverge most when samples are small or you peek often.
Yes, in the sense that each posterior is a valid probability statement on its own — there's no p-value inflation. But peeking still tempts you to stop early on noise. Best practice is to combine P(B > A) with an expected-loss threshold and a minimum sample size so a single lucky day doesn't end the test prematurely.
For checkout or product-page conversion, a weak Beta(1, 1) or Beta(2, 50) prior works fine — it lets the data dominate quickly. If you have strong historical data (say, 2 years of GA4 import showing a stable 2.8% baseline), an informative prior centered on that rate tightens credible intervals and shortens tests, but you should pre-register the choice.
Run for at least one full business cycle (usually 7 or 14 days) to cover weekday/weekend mix and traffic-source variation. Then ship once P(B > A) > 95% and expected loss is below your tolerance — typically 0.05-0.1% of baseline CVR. For a store on a 3% baseline, that's usually 8,000-15,000 sessions per arm.
Slightly, on average, and the gap widens when you'd otherwise need to pre-register a conservative sample size. The bigger gain is flexibility: you stop when the evidence is sufficient, not when an arbitrary horizon is reached. Expect 20-30% faster decisions on typical e-commerce experiments.
Expected loss is the average conversion-rate sacrifice you'd incur if you picked the wrong variant, integrated over the posterior. P(B > A) of 96% sounds great until you realise the 4% downside scenario costs you 8% of revenue. Expected loss collapses both numbers into a single 'how bad is being wrong' figure.
Yes — you compute 'probability to be best' across all arms simultaneously, and expected loss for each. Multi-armed bandits use the same machinery to dynamically allocate traffic toward the leading variant. For standard A/B/n tests, just make sure each arm gets enough traffic before reading results.
Phrase it as a betting odds statement: 'There's a 96% chance the new checkout converts better than the current one, and if we're wrong, we lose about 0.03% of conversions.' Avoid mixing it with p-value language — calling it a 'confidence level' will create exactly the misunderstanding you're trying to avoid.
VWO uses Bayesian stats by default (their SmartStats engine), Optimizely uses sequential frequentist (Stats Engine), and Google Optimize used Bayesian before it sunset. Metricuno reports Bayesian P(B > A) and expected loss alongside frequentist p-values, so you can use whichever framework your team is comfortable with.
If your organisation has regulatory or scientific-publication requirements that mandate p-values (rare in DTC, common in pharma or academia), stick with frequentist. Also avoid Bayesian if you can't commit to documenting and defending your prior — undisclosed priors are the easiest way for results to look more conclusive than they are.
Test ideas before you ship them
Run unlimited A/B tests, attach hypotheses to outcomes, and build a searchable archive of what works — and what doesn't.