Why 90% of Your A/B Tests Are Lying to You (And the Secret Top Tech Companies Use)

Stop Obsessing Over Bayesian vs Frequentist – Your A/B Testing Culture Is the Real Problem
We see e-commerce teams spending weeks debating statistical methodologies while their competitors are running ten times more experiments and shipping decisions faster. They’re arguing about which testing religion to follow when the real question is: are you actually building a culture of experimentation?
Here’s the truth nobody in the A/B testing space wants to say out loud. The method matters less than you think. The mindset matters everything.
Let’s break this down.
The Stats Debate Nobody Is Actually Winning
The experimentation world is caught in a binary war. Bayesian versus Frequentist. SmartStats versus p-values. Probability-to-win versus confidence intervals. And every SaaS vendor is telling you their approach is the one that will save your conversion rate.
The reality? Modern platforms are abandoning the binary entirely. Statsig, Eppo, GrowthBook, LaunchDarkly, Kameleoon – these platforms let you toggle between Bayesian and Frequentist depending on the context. Because the sharpest data teams in the world – the ones at Netflix, Booking.com, Microsoft, and Airbnb – figured this out years ago: the statistical method is a tool, not a religion.
Here is the actual fault line that matters in 2025 and beyond.
Warehouse-native, toggle-everything platforms – think Eppo, Statsig, GrowthBook, LaunchDarkly – are winning data-mature teams because they compose with your infrastructure, offer full methodological transparency, and flex to the experiment in front of you.
Opinionated, single-default platforms – think classic VWO (Bayesian all the way) or legacy Optimizely (sequential-first) – still dominate marketing-led CRO teams because their single default is easier to explain to a stakeholder who never wants to think about statistics in the first place.
Neither camp is wrong. They serve different humans. Know which human you are.
The Three Frameworks Every CRO Leader Needs to Understand
Before you can pick a platform or make a case to your team, you need to know what you are actually choosing between.
Frequentist testing is the OG. It asks: if there were genuinely no difference between control and variant, how likely is it that I would see data this extreme? That is the p-value – not the probability the variant works, but the probability of seeing this data if it did not. It demands a fixed sample size decided before you start, and it breaks violently if you peek at results early without corrections.
Bayesian testing flips the question. It says: given the data I have collected, what is the probability variant B actually beats variant A? You get intuitive outputs. “There is an 87% chance the new checkout flow outperforms the old one.” Stakeholders understand that. They do not understand what a p-value of 0.04 means at 11 pm in a board meeting.
Sequential testing – the approach Optimizely built their entire Stats Engine around, working with Stanford researchers – is the modern solution to the most practical problem in online experimentation. It lets you look at your results continuously without inflating false positives. You can stop early when something is clearly winning or clearly losing. That is not cheating. That is engineering.
Most mature teams end up with a hybrid: frequentist as their backbone for high-stakes decisions, Bayesian for exploratory and low-traffic work, and sequential testing as the continuous monitoring layer.
The Real Competitive Advantage Nobody Is Talking About: Variance Reduction
Here is the insight that will separate your programme from the pack in the next 24 months.
Not the Bayesian-versus-Frequentist debate. Not bandits. Not AI-powered test generation.
CUPED. And its more sophisticated cousins.
CUPED – Controlled-experiment Using Pre-Experiment Data – is a technique Microsoft researchers developed in 2013 that uses a visitor’s historical behaviour before the test began to strip out background noise from your metric. When you reduce that noise, you need fewer visitors to reach a reliable conclusion. That means your experiments run faster. At Microsoft, CUPED collapsed 8-week tests to 5-6 weeks. At Booking.com, it became the primary tool for detecting effects as small as 1% on a 2% conversion rate – which would otherwise require more than 15 million users.
Eppo’s enhanced version, CUPED++, claims up to 65% faster experiment conclusions. Research from KDD 2025 showed CUPAC – an ML-powered extension of the same idea developed at DoorDash – can reduce confidence intervals by more than 35%.
When you are evaluating A/B testing platforms, this is the question that should top your list: does this platform do CUPED, and is it switched on by default?
Platforms with first-class variance reduction include Statsig, Eppo, GrowthBook, LaunchDarkly, Kameleoon, and Spotify Confidence. Platforms where it is more limited or not natively prominent include Adobe Target and the legacy Optimizely stack.
That gap in your testing velocity is not a statistical preference. That is a competitive advantage.
The Peeking Problem Is Destroying Your Data Quality (And You Probably Do Not Know It)
I want to talk about what researchers at Optimizely found when they actually audited how their customers were using fixed-horizon tests.
People were peeking at the results every single day. And when results looked significant, they were stopping the test and declaring a winner.
That behaviour – totally understandable, totally human, totally wrong – inflates your false positive rate from a nominal 5% to more than 30%. What that means in practice: more than one in four of your “winning” tests is a statistical accident. You shipped something based on noise, called it a conversion win, and moved on with misplaced confidence.
Johari, Pekelis and Walsh at Stanford published the formal proof of this in 2017. Optimizely built their Stats Engine to fix it. The solution – sequential testing with always-valid inference – lets you look at results anytime without corrupting your conclusions.
If your platform does not solve the peeking problem with proper sequential corrections, you are operating with a 30% false discovery rate. Think about what that does to your roadmap.
Addressing the Objections You Are Already Forming
“Our traffic is too low for rigorous testing.”
This is the most common objection in e-commerce CRO, and it is also the most misused. Low traffic does not mean you cannot test – it means you need the right approach for low traffic. Bayesian testing with non-informative priors is more forgiving at small sample sizes than fixed-horizon frequentist tests. Variance reduction through CUPED amplifies the power you do have. And focusing your tests on large, meaningful changes – above-the-fold layouts, primary CTAs, pricing structures – rather than micro-optimisations means you are more likely to detect real signals even with modest visitor volumes.
“Bayesian results are biased because of the prior.”
Yes, poorly chosen priors introduce bias. Non-informative priors largely eliminate this concern while keeping the interpretive advantages. And here is the counterpoint: frequentist confidence intervals are also frequently misread as posterior probabilities by non-statisticians. Both frameworks carry interpretive risk. The question is which risks you manage better in your specific team context.
“We cannot get statistical significance fast enough to make decisions.”
Then you are probably measuring the wrong metrics or building tests around effects too small to detect in your traffic window. Netflix shifted from asking “is this statistically significant?” to evaluating decision rules based on their cumulative returns to North Star business metrics across 123 historical A/B tests. One new decision rule they identified this way was estimated to increase cumulative returns by 33% – not because the statistics were smarter, but because the question was better.
“Bandits are the future. Why test at all?”
Multi-armed bandits are excellent tools for specific, short-horizon problems where maximising revenue during the learning period matters more than getting a clean causal estimate. Headline rotation on a homepage during a 48-hour sale. Image selection for a paid campaign. That is the sweet spot. For checkout redesigns, pricing structure changes, or any test where you need an unbiased measurement of lift, bandits are the wrong tool. They give you traffic optimisation, not causal clarity.
What Booking.com, Netflix, and Microsoft Actually Do (And What You Can Steal From Them)
These three case studies tell you everything you need to know about experimentation at scale.
Booking.com runs more than 1,000 concurrent A/B tests. Only roughly 10% produce meaningful wins. Their former Director of Experimentation, Lukas Vermeer, put it plainly: the secret was not the number of tests. It was that experimentation became the default mode of decision-making. Every feature – no matter how obvious the outcome seemed – gets tested. The majority of experiments produce null results. That is not failure. That is information.
Netflix went further. They stopped treating statistical significance as the final arbiter of shipping decisions. Instead, they built a framework that evaluates decision rules – the policies that map experimental results to ship or no-ship decisions – based on cumulative returns to business North Star metrics across their entire historical library of tests. They also invest heavily in detecting heterogeneous treatment effects: identifying which specific user segments respond to a treatment differently. Because “the new variant wins on average” and “the new variant wins for your highest-value customer segment” are two very different statements.
Microsoft ExP runs approximately 100,000 A/B tests per year. They invented CUPED. They built SRM (Sample Ratio Mismatch) detection as a first-class diagnostic that runs before any statistical analysis. And the most important lesson from nearly 20 years of their data: about one-third of ideas are positive and statistically significant. One-third are flat. The remaining third cause harm. Every team that skips testing because they are confident in their intuition is operating inside that final third and never finding out.
The common thread across all three? Tools are easy. Culture is hard. Getting engineers, product managers, designers, and marketers to default to experimentation over opinion is the most valuable competitive advantage in e-commerce conversion – and no statistical engine gives it to you.
The Decision Framework That Actually Applies to Your Business
Here is how to match methodology to context.
If you run a small Shopify store with fewer than 100,000 monthly visits, go Bayesian with a non-informative prior. Use VWO, Convert.com, or GrowthBook. Do not over-interpret small differences. Run tests for at least two to four weeks. Focus on changes with large expected effects – primary CTA, page layout, trust signals near checkout.
If you are a medium-sized retailer with between 100,000 and 10 million monthly visits, this is the sweet spot for most platforms. Use sequential testing with CUPED enabled. Statsig, Kameleoon, or LaunchDarkly are all solid choices. Build a metric hierarchy – one primary metric per experiment, two or three guardrails, a wider exploratory set with false-discovery-rate correction.
If you are an enterprise e-commerce business, you need a warehouse-native platform with full sequential validity, CUPED, SRM diagnostics, and documented methodology traceable to peer-reviewed research. Eppo or Statsig. Or build in-house. At this scale, detecting a 1% relative change in a 2% conversion rate requires more than 15 million users – and variance reduction is not a nice-to-have, it is the difference between a four-week test and an eight-week test.
If you are a two-sided marketplace, stop asking “Bayesian or Frequentist?” and start asking “what experimental design controls for network interference?” Standard A/B testing breaks in marketplace settings because treatments spill over between buyers and sellers. Switchback experiments – randomising in time rather than across users – are what DoorDash, Uber, and Airbnb use. Statsig supports switchback natively.
The Bigger Picture You Cannot Afford to Miss
The field is shifting. The clearest trends from the most current research:
Sequential testing – the ability to look at results continuously with valid inference – is no longer an advanced feature. It is the baseline. Every serious platform either defaults to it or is racing to add it.
Variance reduction through CUPED-class methods is becoming universal among the platforms that matter. Within three years it will be table stakes everywhere.
The Bayesian-versus-Frequentist framing will matter less and less as platforms offer both side-by-side and let you choose per test. The real competitive differentiation is shifting to: quality of variance reduction, warehouse-native transparency, and how tightly experimentation integrates with your analytics and personalisation stack.
AI-assisted experiment analysis – tools that surface segment patterns, explain why winners won, and generate follow-up hypotheses – is the fastest-moving frontier right now. Statsig and Eppo have already shipped LLM-based analytical assistants. Adobe’s Experimentation Accelerator represents a move from passive testing to proactive, AI-guided experimentation.
And LLM-powered simulation tools like SimGym are showing early promise in reducing test cycles from weeks to under an hour for UI changes – though they are not yet a substitute for randomised online testing.
The Action You Need to Take Today
Most CRO teams are running fewer tests than they should, on the wrong metrics, with platforms that do not protect them from peeking, and without the variance reduction tools that would make their tests twice as fast.
Before you touch your next test, answer these four questions:
One. Does your platform support sequential testing or always-valid inference? If not, you are operating with a 30%+ false positive rate.
Two. Is CUPED or a comparable variance reduction technique enabled on your key conversion metrics? If not, your experiments are running at least 30% longer than they need to.
Three. Do you run SRM checks before reading any results? A Sample Ratio Mismatch – where your actual traffic split does not match your configured split – invalidates everything downstream.
Four. Does your organisation treat a null result as useful information? Or does a flat experiment get treated as a waste of time? The answer to that question is the biggest indicator of whether your experimentation programme will compound in value over the next three years or stay permanently stuck.
The conversion rate belongs to whoever builds the better learning machine.
Build yours.
The sources synthesise findings from a diverse range of evidence, spanning rigorous academic theory to real-world commercial applications. The key areas of synthesis include:
- Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press.
- Wald, A. (1945). Sequential Analysis.
- Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). “Improving the Sensitivity of Online Controlled Experiments by Utilising Pre-Experiment Data.” Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM).
- Deng, A., et al. (2023). “Variance Reduction Using In-Experiment Data: Efficient and Targeted Online Measurement for Sparse and Delayed Outcomes.”
- Li, J., Tang, J., & Bauman, J. (2020). “Control Using Predictions as Covariates (CUPAC).” DoorDash Engineering.
- Howard, S. R., Ramdas, A., McAuliffe, J. D., & Sekhon, J. S. (2021). “Time-uniform, nonparametric, nonasymptotic confidence sequences.” Annals of Statistics.
- Johari, R., Pekelis, L., & Walsh, D. J. (2015/2022). “Always Valid Inference: Bringing Sequential Analysis to A/B Testing.” arXiv:1512.04922.
- Johari, R., Koomen, P., Pekelis, L., & Walsh, D. J. (2017). “Peeking at A/B Tests.” Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
- Waudby-Smith, I., et al. (2023). “Asymptotic Confidence Sequences.” arXiv:2103.06476.
- Zhao, Z., et al. (2019). “mSPRT Research and Sequential Testing Methodology.”
- Abadie, A., Diamond, A., & Hainmueller, J. (2010). “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of the American Statistical Association (JASA).
- Bojinov, I., Simchi-Levi, D., & Zhao, J. (2023). “Design and Analysis of Switchback Experiments.” Management Science.
- Künzel, S. R., Sekhon, J. S., Bickel, P. J., & Yu, B. (2019). “Metalearners for estimating heterogeneous treatment effects using machine learning.” Proceedings of the National Academy of Sciences (PNAS).
- Wager, S., & Athey, S. (2018). “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests.” Journal of the American Statistical Association (JASA).
- Fabijan, A., Gupchup, S., Fuptha, S., et al. (2019). “Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners.” KDD 2019.
- Wasserstein, R. L., & Lazar, N. A. (2016). “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician.
- Adobe Research (2025). “Experimentation Accelerator: A Unified Framework for AI-Guided Experimentation.”
- Diamantopoulos, et al. “Engineering for a Science-Centric Experimentation Platform.” (Netflix XP).
- Gupta, S., et al. “The Anatomy of a Large-Scale Online Experimentation Platform.” (Microsoft ExP).
- Lindon, M., Ham, D., Tingley, D., & Bojinov, I. (2024). “Anytime-valid linear models for sequential A/B testing.” (Netflix Research).
- Stucchio, C. “Bayesian A/B Testing at VWO: SmartStats Technical Whitepaper.”
- Airbnb Engineering. “Experiment Reporting Framework (ERF) and Marketplace Network Effects.”
- Booking.com (Lukas Vermeer). “Sequential Testing and SRM Diagnostics at Scale.”
- DoorDash Engineering. “Multi-Armed Bandit Platforms and CUPAC Implementation.”
- Li, A. (2025). “The Great Statistical Engine Wars: How A/B Testing Platforms Actually Make Decisions.”
- Spotify Engineering. “Choosing a Sequential Testing Framework: Group Sequential vs. Always-Valid.”
- Why 90% of Your A/B Tests Are Lying to You (And the Secret Top Tech Companies Use)
- The Headline Myth That’s Costing You Conversions (And What Actually Works)
- STOP Surveying Your Customers: The Controversial Truth Behind 30% More E-Commerce Sales
- The Second-Person Trap: Why “You” Isn’t Always the Answer in E-commerce Copy
- Why 84% of Online Stores are Failing with Filters (and the 1.4-Second Secret to 4x More Sales)
Articles On Sale
-
Brand Voice Conversion Formula
Original price was: £17.50.£7.45Current price is: £7.45. -
Celebratory Add-To-Cart
Original price was: £17.50.£7.45Current price is: £7.45. -
Content and Context Optimisation for E-commerce
Original price was: £17.50.£7.45Current price is: £7.45. -
Contextual Imagery
Original price was: £17.50.£7.45Current price is: £7.45. -
Delivery Date Fix for Conversions
Original price was: £17.50.£7.45Current price is: £7.45.










