Most product managers run A/B tests.
Few run them correctly.
Look, the problem isn't that this stuff is complicated. People just skip the basics. They launch tests without calculating sample sizes. They check results every day like they're watching their 401k during a market crash. They celebrate wins that move the needle by 0.03%.
This guide teaches you the fundamentals. Not advanced techniques. Not Bayesian methods. Just the core stuff that separates tests you can trust from tests that waste three weeks of engineering time.
What A/B Testing Actually Does
A/B testing answers one question: Is Version B better than Version A?
Sounds simple.
It's not.
You can't just show B to 100 users and A to 100 users and call it done. Random chance creates differences everywhere. User A converted because it's payday. User B bounced because their train just arrived at their stop. You need way more data to separate actual improvements from random noise.
Think about flipping a coin. Get three heads in four flips and you might think the coin's weighted. Flip it 100 times and get 75 heads? Ok yeah, something's definitely wrong with that coin.
A/B testing works the same way. More data = less bullshit.
The Four Concepts You Need to Understand
Statistical Significance
This tells you if the result's real or just noise.
When you test two versions, you'll always see a difference. Version B shows 5.2% conversion while A shows 5.0%. Great. But is that actually a real difference or just random variation from Tuesday being different than Wednesday?
Statistical significance gives you an answer. It measures how confident you can be that the difference reflects actual user behavior, not just dumb luck. Standard threshold: 95% confidence. This means if you ran the test 100 times, you'd see this result 95 times if the difference is real.
Here's what people get wrong: They think statistical significance means their change caused the improvement. Nope. It just means the difference probably isn't random. Your change might've launched the same week Apple announced their new phone. Or your competitor had a major outage. Or it was Black Friday. Statistical significance can't tell you WHY something happened.
P-Value
The p-value answers: "If the versions are actually identical, what's the probability we'd see this difference by dumb luck?"
Standard rule is p-value below 0.05.
What that actually means: A p-value of 0.03 tells you there's a 3% chance you'd see this difference if nothing really changed. That's low enough to trust. A p-value of 0.15? There's a 15% chance this whole thing is just random noise.
Too risky.
What p-values DON'T tell you: How much better B is than A. A tiny, meaningless improvement can have a low p-value if you've got enough users. The p-value just confirms the difference exists. Not whether it matters.
Confidence Intervals
This gives you a range for where the true result probably sits.
Say your test shows B improved conversions by 10%. The confidence interval might be 7% to 13%. The real improvement is somewhere in that range. Probably.
Narrow ranges mean you're more certain. Wide ranges mean you need more data.
Here's why this actually matters in practice: If your confidence interval for B is 8% to 12% and for A is 10% to 14%, those ranges overlap. You can't declare a winner yet. The true values might be identical.
How to use this: Before launching a test, decide your minimum meaningful improvement. If your confidence interval doesn't clear that threshold, keep testing or move on to something else.
Statistical Power
Power measures your test's ability to detect real improvements when they exist.
Standard setting: 80% power. Translation: "If there's a real difference, we'll catch it 8 out of 10 times."
Low power explains why so many tests end "inconclusive." The improvement was real, but your test couldn't detect it. You needed more users or a bigger change.
The tradeoff: Higher power needs more users. Most companies stick with 80% because it balances accuracy with not waiting six months to ship anything. Going to 90% power often doubles your sample size. Going to 95%? You're looking at 3-4x the users.
Sample Size: The Most Important Calculation
You need to know your sample size BEFORE launching a test.
Not "let's run it for two weeks and see." Not "we'll check back when we hit 10,000 users." Calculate the exact number of users required per version. Do this first.
Why Sample Size Actually Matters
Small changes need huge sample sizes to detect.
Rough estimates @ 15% conversion rate:
- 50% improvement: ~400 users per version
- 20% improvement: ~2,500 users per version
- 5% improvement: ~40,000 users per version
Notice the pattern? Halving the effect size roughly quadruples the required sample.
This creates a practical constraint. If you need 80,000 total users and get 500 visitors per day, your test runs 160 days. That's not realistic. You either need more traffic, a bigger change, or a different approach.
Sample Size Depends on Four Things
Current conversion rate: Your baseline performance. Higher rates need fewer users to detect changes.
Minimum detectable effect: The smallest improvement you care about. Want to detect 5% lifts? You'll need way more users than if you only care about 20% improvements.
Significance level: Usually 95% (alpha = 0.05). Going to 99% increases sample size but reduces false positives.
Statistical power: Usually 80% (beta = 0.20). Going to 90% increases sample size but reduces false negatives.
Don't guess these numbers. Use a calculator
Worked Example
You're testing a new product page.
- Current conversion: 5%
- Minimum effect: 20% relative improvement (from 5% to 6%)
- Significance: 95%
- Power: 80%
Sample size needed: 7,664 visitors per variant. Total: 15,328 visitors.
Your page gets 1,000 daily visitors? You need 16 days minimum. But that assumes perfect 50/50 splits and no data quality issues. Add a buffer. Run it for three weeks.
The Five Rules That Prevent Bad Tests
Rule 1: Calculate Sample Size First
No exceptions.
Before you write the test plan. Before you brief engineering. Before you tell your boss about this great idea.
Calculate how many users you need. Then check if you can actually run the test that long.
Why this matters: Most inconclusive tests happen because people didn't calculate sample size. They ran the test for a week because a week felt reasonable. The test needed three weeks. Results were meaningless.
Rule 2: Don't Check Results Early
This is the hardest rule to follow.
You'll want to peek. Your boss will ask for updates. Engineering will wonder if it's working.
Resist.
The problem with peeking: Each time you check results, you give randomness another chance to fool you. It's like flipping a coin until you get the result you want. Eventually, random variation will create a "winner."
Statistical significance assumes you check once, at the end. Check 10 times and your 5% false positive rate becomes way higher.
What to do instead: Set your sample size. Wait until you hit that number. Then look at results. Just once. If you absolutely must check early (and you shouldn't), use sequential testing methods designed for interim looks. Don't just eyeball the dashboard.
Rule 3: Ship Only Clear Winners
Your test shows statistical significance.
Great.
Now ask: does the improvement actually matter?
B improved conversion by 0.1%. Statistically significant with your huge sample size. But 0.1% won't move your business metrics. Don't ship it.
Decide this upfront: What's the minimum improvement worth the engineering effort, risk, and opportunity cost? Write it in your test plan. Stick to it.
Real example: You calculate that 0.5% conversion improvement adds $50K annual revenue. Engineering estimates two weeks to implement. Is that worth it? Depends on what else they could build in two weeks.
Rule 4: Test One Thing at a Time
Testing 10 variations simultaneously feels productive.
It's not.
The multiple comparison problem: Test 10 versions and one will probably look like a winner by chance. The more variations you test, the more likely you'll see false positives.
What to do: Test 2-3 versions maximum. Or adjust your significance level using Bonferroni correction. Testing 5 versions? Use p < 0.01 instead of p < 0.05. But really, just test fewer things. If you have 10 ideas, run them sequentially. Test the most promising first.
Rule 5: Check Your Traffic Split
Your 50/50 test should send exactly 50% to A and 50% to B.
Check this on day one.
Off by 0.5%? Probably fine. Off by 2%? Your randomization is broken. Stop the test and fix it.
Why this matters: Broken randomization means your groups aren't comparable. Maybe A gets more mobile traffic. Maybe B gets more returning users. Your results won't be valid.
How to check: Look at your analytics on day one. Calculate the actual split. If it's wrong, investigate before more users see the test.
Common Mistakes That Ruin Tests
Testing Without Enough Traffic
You want to test a 5% improvement. You get 200 visitors per day. Do the math: you need 40,000 users per variant.
That's 400 days.
The reality: You can't test small improvements without large traffic. Pick bigger changes. Or use a different method like user research or fake door tests.
Celebrating Noise
B shows 12% improvement. Statistical significance: p = 0.049. Ship it!
Wait.
Check the confidence interval. It's -2% to 26%. That means the true effect might be negative. Or might be huge. You don't actually know.
What to do: Look at confidence intervals, not just p-values. If the range includes zero or your minimum effect size, keep testing.
Ignoring Sample Ratio Mismatch
Day 10 of your test. A has 5,234 users. B has 4,891 users.
Close enough, right?
Wrong. That's a 52/48 split. Something's broken.
Common causes: Bot traffic hitting one version more. Redirect issues. Page load failures. Whatever the cause, your results aren't trustworthy.
Testing Everything
You have 20 ideas. You test all of them at once on different pages. Ship the winners.
The problem: Run 20 tests at 95% significance and one will probably show false results by chance. You'll ship changes that don't work.
Better approach: Prioritize ruthlessly. Test your best ideas. Accept that you can't test everything.
Stopping Tests Early
Your test hits significance on day 3. You needed 14 days. You ship it anyway.
What happens next: The effect disappears. Your early win was random variation. Now you've built something that doesn't work.
The discipline: Decide your sample size. Run to completion. No exceptions for "good news."
Real-World Context
Booking.com's Reality Check
Booking.com runs 25,000 tests annually.
Only 10% improve their metrics.
This isn't failure. It's how testing works. Most ideas don't work. The companies that succeed test more ideas, fail faster, and ship the rare winners.
Your takeaway: Don't expect every test to win. Expect 1 in 10 to work. That's normal. That's good.
Microsoft's $80 Million Blue Link
Microsoft tested different shades of blue for ad links. The winning color (#0044CC) generated $80 million more revenue annually.
Two lessons here:
First, small changes can matter at scale. But you need massive sample sizes to detect them reliably. Microsoft had the traffic to test subtle color differences. You probably don't.
Second, what looks like a tiny change might have huge financial impact. Test systematically, even when changes feel minor.
Obama Campaign's Counter-Intuition
The Obama campaign tested 24 combinations of images and button text on their signup page.
The team's favorite: a professional photo with "Sign Up Now."
The winner: A casual family photo with "Learn More." It raised 40% more donations (that's $60 million additional revenue).
The lesson: Your intuition about what works is probably wrong. Test it. Button text matters. Images matter. Your assumptions don't matter.
Airbnb's Representative Sample Problem
Airbnb tested making "Instant Book" more prominent. The test showed clear improvements. They launched to everyone.
Metrics got worse.
What happened: The test users weren't representative of all users. Random sampling matters. So does sample composition.
How to avoid this: Make sure your test users match your overall population. Check that mobile/desktop splits, new/returning user ratios, and geographic distributions align. If your test users skew heavily toward power users, your results might not generalize.
When to Skip A/B Testing
A/B testing isn't always the right tool.
Don't A/B test:
Obvious bugs. If your checkout button doesn't work, fix it. Don't test whether working buttons perform better.
Legal or regulatory requirements. GDPR compliance isn't optional. Don't test "with consent" vs "without consent."
Insufficient traffic. Fewer than 1,000 weekly users? You can't reliably test small improvements. Use other methods.
Fundamental business model changes. You can't A/B test "subscription vs advertising." The differences are too large and affect too many systems.
Time-sensitive campaigns. Black Friday sales can't wait three weeks for test completion. Make your best decision and move on.
Use these instead:
User interviews to understand why users behave certain ways. Numbers tell you what happened. Users tell you why.
Fake door tests to validate demand for new features. Show a button for the feature. Count clicks. Build only if enough people click.
Gradual rollouts to reduce risk for major changes. Release to 1%, then 5%, then 10%. Watch for problems. This isn't A/B testing (it's progressive deployment) but it works.
Before/after analysis for unavoidable changes. Track metrics before the change and after. Not as rigorous as A/B testing, but sometimes it's your only option.
Your Pre-Launch Checklist
Complete these steps before launching any test:
Calculate sample size: Use a calculator. Write down the exact number.
Define success criteria: What improvement matters? What's your minimum threshold? Write this down before seeing results.
Check your traffic: Do you get enough daily users to reach your sample size in reasonable time?
Set test duration: Based on traffic and sample size, how many days minimum? Add a buffer for weekends or holidays.
Verify the split: On day one, check that traffic divides correctly between versions.
Plan your analysis: Pick a specific date and time when you'll check results. Put it on your calendar. Don't look before then.
Simple Test Plan Template
Document every test with this structure:
## Test: [Descriptive name]
What we're testing:
[Describe the change in one sentence]
Why we think it'll work:
[Your hypothesis]
Primary metric:
[What number should improve?]
Secondary metrics:
[What other numbers might change?]
Sample size needed:
[Calculated number per variant]
Test duration:
[Days required based on traffic]
Decision criteria:
- Ship if: [Define your thresholds]
- Keep testing if: [When to extend]
- Stop if: [When to abandon]
Launch date:
[When test starts]
Analysis date:
[When you'll check results]
This template forces you to think through the test before launching. It also creates documentation for future reference.
Getting Started Today
Pick one test you want to run. Use the template above. Calculate your sample size properly.
Don't run the test yet. Just plan it completely. Write down every parameter. Calculate the exact sample size. Set your success criteria.
Now look at what you wrote. Is the test actually feasible? Can you wait that long? Is the improvement big enough to matter?
If yes, launch it. If no, redesign the test or pick a different approach.
That's the discipline. Most bad tests come from skipping this planning step.
Five Things to Remember
One: Calculate sample size before launching. This is non-negotiable.
Two: Don't check results early. Wait for your full sample.
Three: Small improvements need large samples. Be realistic about what you can detect.
Four: Most tests fail. That's normal. Keep testing.
Five: Trust the process. Statistics work when you follow the rules.
Tools and Resources
Quick Reference: Terms That Matter
Confidence interval: The range where the true result probably sits.
Control: Your original version (Version A).
Minimum detectable effect: The smallest improvement you care about finding.
P-value: Probability of seeing this result if nothing actually changed.
Power: Your test's ability to detect real improvements. Usually 80%.
Sample size: Number of users needed per version to reach reliable conclusions.
Statistical significance: Evidence the difference is real, not random. Usually 95% confidence.
Variant: Your new version being tested (Version B).
-
Start with these fundamentals. Master them completely before exploring advanced techniques.
-
Most testing failures come from skipping basics, not from lacking sophisticated methods.
Calculate your sample size. Wait for completion. Ship clear winners.
That's it.