One of the most common source of false results when doing A/B testing is to stop your tests too early or in the middle of a business cycle. If you do that, your results will most likely be off, sometimes by a small margin, sometimes by an order of magnitude. You will end up making decisions on the wrong data, which is probably worse than making decisions with no data at all.
Yet it’s very easy to make sure you run your tests appropriately. There are only 3 rules to follow:
- do not stop your test until you have reached the minimum sample size that will make your test results statistically valid and
- do not stop your test before you have run it for at least a complete business cycle and
- run your test for full business cycles (do not stop it after one and half business cycle for example).
That’s it. You follow these guidelines and your results will be valid. You don’t, and they won’t. It’s that simple. I know some CRO practitioners try to simplify things by giving guidelines like 250 conversions or 5000 visits or 3 weeks, etc. but these numbers are meaningless. I have clients that have 1 conversion a day and clients that see 30,000+ transactions daily. Try to use such guidelines number with different clients and you end up with absurd and false data.
Now let’s review each of these rules.
Minimum Sample Size
All A/B testing tools provide you with a unique metrics to gauge the statistical validity of your test results: the statistical significance. But this metric is basically meaningless unless you combine it with the Minimum Sample Size, which is the minimum number of Unique Visitors that should be tested before you can declare your test results valid. Put simply, it is the amount of traffic you will need to get through your test before stopping it even if your testing tool declares a winner at 99% statistical significance before!
To understand why, you need to consider the fact that the statistical significance is computed only on the difference between Conversion Rates regardless of the number of visitors tested, and can be greatly biased with small numbers of visitors and conversions. It is the same statistical effect you get when playing head or tail with a coin: after 10 tries, you could very well get a 9 heads and 1 tail result just by random. In that case, you would get an indication that your test is 99% statistically significant, even though with just 10 tries, it doesn’t mean anything. The minimum sample size is the number of tries you would need to do before you can test your results for statistical significance.
Now the question is: how do you know the minimum sample size you need to use? The answer is: it is unique to each test and it depends on the performance achieved by your winning variation over your control. Basically, the more your tested variation(s) outperform your control version, the smaller your minimum sample size will be. The easiest way to compute it is to use an online calculator. Each A/B testing tool has a basic one you can use (for example, here are Optimizely’s and Visual Website Optimizer’s).
I have also 2 more advanced online calculators on this website:
- A/B Testing Strategy Planning Calculator: use this calculator before starting a test so you can estimate how long a test will take to complete given a range of different performance levels. You can quickly see the estimated duration of your test if your winning variation outperforms your control by 1%, 2%, 50%, etc. Note that these estimates are for planning different scenarios. You won’t know for sure what performance your test will achieve until you’ve started it.
- A/B Testing Results Validation Calculator: use this calculator after you started a test to compute with your exact data how long your test still needs to run to reach its exact Minimum Sample Size as computed exactly with the test data.
The exact calculations that define how the Minimum Sample Size should be computed are explained in details in this post: the maths behind the minimum sample size in A/B testing, but here are the main principles to know:
- the more traffic you have going through the pages tested, the quicker your test will complete of course.
- the more variations you run as part of a test, the longer you’ll have to wait for the test to complete. Ex: a test with only the control and 1 variation will be quicker to complete, all things equal, than a test with the control and 4 variations.
- the lower your current (before the test) Conversion Rate, the longer you’ll have to wait. Ex: if your conversion rate is 5%, then your tests will complete quicker than if your conversion rate is only 1%.
- the lower your best variation’s performance, the longer you’ll have to wait. Ex: if the best variation tested achieves a 50% increase in the conversion rate, then your test will be much quicker to complete than if its performance was only 10%.
- multivariate tests as a rule take much more time to complete, because they result in many more variations than most regular A/B tests.
Best practice: always plan your tests by calculating the Minimum Sample Size that will be needed for a realistic test performance result, and keep using the results validation calculator when your test is running to ensure you will reach the sample size in an actionable timeframe (tests running 6 months to validate are useless!).
Run tests for at least a complete business cycle
The second rule to follow is to run your test for at least a full business cycle, which is weekly in 95% of cases. Even if you reach your Minimum Sample Size in 3 days, you should not stop your test until it has run for 7 full days, or whatever duration your business cycle is.
That’s because you want your test results to reflect the full mix of visitor types, and those types can vary wildly between early morning on a week day and the afternoon of Sunday. Your transactions, average order value and conversion rate can be very different between each day of the week. This reflects a difference in the motivations, timelines and overall behavior of your visitors, and you need to capture all these differences in your test for it to be valid.
To clarify why this is important, let’s take the example of a highly trafficked ecommerce website that sees lower overall visits during the week, but sees a spike during the week-end, and let’s further assume that the traffic spike during the week-end is driven by mobile visits. You start a test on Monday on a new checkout flow, and since you have high traffic and good performance of one of your variation, you reach your Minimum Sample Size after only 3 days. But you know you should test the full business cycle, so you continue testing until the following Monday. Las, the test performance dips during the week-end and ends up negative for the overall period: your new checkout flow performed better than the control on desktop browsers but much worse on mobile.
Had you stopped your test before the full business cycle, you would have ended up implementing a checkout flow that would have crashed your conversion rate during the highest part of your weekly sales cycle.
You can easily see your business cycle with any Google Analytics chart set on a 1 month period with daily data points. You should see quite clearly the repetitive ups and downs. You can also get more details on the magnitude of the differences by looking at your metrics by day of the week, hour of the day, day of the month (if your business has a monthly cycle). You can use the Seasonality / Business Cycle report for CRO and A/B Testing custom report I shared on GA Solutions Gallery to get this data easily.
Do not stop tests mid-cycle
The third and last principle is just an extension of the second: you will face the same issues if you stop tests in the middle of a cycle, even if you have already run them for a full cycle before. Let’s say your business cycle is 7 days and that by day 10 you have also reached the required minimum sample size. Then you should continue the test for an additional 4 days to stop it after 2 full business cycles, and not in the middle of one.
In practice: be disciplined but the right tools make it very easy
So there you have it, the 3 principles to follow to know for sure how long to run your tests for. The most complex is the concept of Minimum Sample Size. But the online tools available to you make it extra simple to implement even this one. The very first step you should take if you have a test running is to put your raw test data in my A/B Testing Results Validation Calculator and you will get exactly how many days you still have to wait to reach the minimum sample size required for your exact test. The other 2 principles are more a matter of well implemented testing processes.
Photo credit: resplashed