Your results will appear here
Fill in the parameters and click Calculate Sample Size to see
required visitors, test duration, and more.
Your results will appear here
Enter variant data on the left and click Run Analysis to see p-values, confidence intervals, significance, and more.
Statistical methodology, demystified
Three engines behind smarter experimentation. Pick the one that matches how your team decides to ship.
Frequentist
The classical approach. You set a significance level (ฮฑ) and statistical power (1 โ ฮฒ) before the test, run until the required sample size is reached, then check whether p < ฮฑ. If yes, the result is statistically significant. Gives a clear yes/no decision with known error rate guarantees.
Sequential
Monitor continuously and stop early when sufficient evidence accumulates โ while still controlling false positive rates. Uses always-valid p-values or the Sequential Probability Ratio Test (SPRT). You trade a larger sample size (~1.5โ2ร) for the freedom to stop a winning variant before the planned end date.
Bayesian
Instead of a p-value, you get an intuitive probability: “There is an 87% chance variant B beats A.” Incorporates prior beliefs (or uses a flat/uninformative prior). No fixed sample size required โ stop when the probability threshold you care about is reached. Results are directly interpretable.
A/B testing answers on the same page as the math
Use these notes to understand sample size, MDE, p-values, RPV, APC, and products per visitor before starting โ or when reading a result.
Sample size & MDE
Sample size depends on four inputs: your baseline rate, the minimum effect you care about (MDE), confidence level (1 โ ฮฑ), and power (1 โ ฮฒ). Halving the MDE roughly quadruples the required sample. Always set these before running โ peeking and stopping early inflates the false positive rate in standard frequentist tests.
P-values & confidence
A p-value is the probability of observing data at least as extreme as yours if the null hypothesis is true. p < 0.05 does not mean a 95% chance your variant is better โ it means the result would occur less than 5% of the time by chance alone. A 95% confidence interval contains the true effect in 95% of repeated experiments, not with 95% probability for this single test.
Power & error control
Statistical power (1 โ ฮฒ) is the probability of detecting a real effect when one exists. At 80% power you’ll miss a true effect 20% of the time (Type II / false negative). ฮฑ controls Type I error (false positives). Higher power and lower ฮฑ both require larger samples. Industry default: ฮฑ = 0.05, power = 80%.
Sample Ratio Mismatch
SRM occurs when the observed traffic split differs from the intended allocation โ e.g. planning 50/50 but seeing 48/52. SRM invalidates test results even if p-values appear significant. Always check with a chi-square goodness-of-fit test. Common causes: bot traffic, browser redirects, sticky cookies, CDN caching, or experiment SDK misconfiguration.
Conversion rate
CR = Conversions รท Visitors. Use for discrete actions: purchases, sign-ups, clicks. Relative lift = (CR_B โ CR_A) / CR_A. Bayesian analysis treats conversions as draws from a Beta-Bernoulli model. Frequentist: use a two-proportion z-test for large samples. For fewer than 30 conversions per variant, consider Fisher’s exact test instead.
RPV, APC & products
Revenue Per Visitor (RPV) = Total Revenue รท Visitors. Average Products per Conversion (APC) = Total Units รท Conversions. These continuous metrics use Welch’s t-test rather than a z-test and require a standard deviation estimate to size the test. Revenue data is typically right-skewed โ consider a log-transform or non-parametric test if outliers are severe.
