Getting Statistical Significance: When to Call a Winner
You've launched an A/B test, and after three days, Variation B is outperforming the original by 20%. Time to call a winner and ship it, right? Not so fast. That 20% lift might be real, or it might be a statistical fluke that disappears with more data. Understanding statistical significance is the difference between making decisions based on evidence and making decisions based on noise.
What Statistical Significance Actually Means
Here's the simplest way to think about it: imagine you flip a coin 10 times and get 7 heads. Does that prove the coin is biased? Probably not — getting 7 heads in 10 flips is unusual but not that unusual. It happens about 17% of the time with a perfectly fair coin. But if you flip the coin 1,000 times and get 700 heads, you can be extremely confident the coin is biased. The result is the same ratio (70% heads), but the larger sample size gives you much more confidence in the conclusion.
Statistical significance works the same way in A/B testing. When we say a result is "statistically significant at 95% confidence," we mean there's only a 5% chance that the observed difference is due to random chance rather than a real difference between the variations. The larger your sample size and the bigger the difference between variations, the more confident you can be.
Why 95% Is the Standard (And When 90% Is Acceptable)
The 95% confidence threshold is a convention borrowed from scientific research. It means you're willing to accept a 5% chance of a false positive — declaring a winner when there isn't a real difference. For most business decisions, this is a reasonable trade-off between confidence and speed.
However, there are situations where 90% confidence is acceptable. If you're testing a minor copy change on a low-traffic page, waiting for 95% confidence might take months. In these cases, 90% confidence gives you a reasonable level of certainty while allowing you to move faster. The key is to be explicit about the confidence level you're using and understand the trade-off: a 10% chance of a false positive instead of 5%.
The Sample Size Problem
One of the most common surprises in A/B testing is how much traffic you actually need. If your baseline conversion rate is 3% and you want to detect a 10% relative improvement (from 3.0% to 3.3%), you need approximately 35,000 visitors per variation to reach 95% confidence. That's 70,000 total visitors for a two-variation test. If your page gets 500 visitors per day, that test needs to run for about 140 days — nearly five months.
This is why experienced testers focus on high-traffic pages and test for larger differences. Testing a completely different headline approach (which might produce a 30-50% lift) requires far fewer visitors than testing a minor word change (which might produce a 5-10% lift). Prioritize bold tests on high-traffic pages to get actionable results faster.
Common Mistakes That Invalidate Your Tests
- Calling winners too early: This is by far the most common mistake. After a few hundred visitors, random variation can make one option look significantly better. Resist the urge to peek and decide early. Set your required sample size before starting the test and commit to waiting.
- Running tests during unusual traffic periods: Launching a test on Black Friday, during a product launch, or during a seasonal spike introduces confounding variables. Your test results might reflect the unusual traffic composition rather than a genuine copy difference. Run tests during normal traffic periods whenever possible.
- Testing too many variations at once: Each additional variation increases the traffic you need to reach significance. A test with 5 variations needs roughly 2.5 times more traffic than a test with 2 variations. Stick to 2-3 variations unless you have very high traffic.
- Changing your test mid-flight: If you modify a variation, add a new one, or change the traffic split during a test, you invalidate the results. Treat each test as a sealed experiment — set it up, let it run, and don't touch it until it reaches significance.
How Copysplit Handles Significance for You
You shouldn't need a statistics degree to run copy tests. Copysplit continuously monitors your tests and calculates statistical significance in real time. When a test reaches 95% confidence, you'll get a notification with a clear recommendation: which variation won, by how much, and what the estimated revenue impact is. If a test is unlikely to reach significance with your current traffic levels, Copysplit will tell you that too, so you don't waste time waiting for results that aren't coming.
The goal is to make data-driven copy decisions as fast as possible — without sacrificing the statistical rigor that makes those decisions trustworthy. Run more tests, get results faster, and have confidence that your winners are real winners.
Ready to test your copy?
Stop guessing which headlines, CTAs, and page copy will convert. Start testing with Copysplit today.
Get Started Free