The maths behind the minimum sample size in A/B testing

[fusion_builder_container hundred_percent=”yes” overflow=”visible”][fusion_builder_row][fusion_builder_column type=”1_1″ background_position=”left top” background_color=”” border_size=”” border_color=”” border_style=”solid” spacing=”yes” background_image=”” background_repeat=”no-repeat” padding=”” margin_top=”0px” margin_bottom=”0px” class=”” id=”” animation_type=”” animation_speed=”0.3″ animation_direction=”left” hide_on_mobile=”no” center_content=”no” min_height=”none”][latexpage]

This post explains in detail the parameters and maths behind the online calculators available on this website: the A/B testing strategy planning calculator and the A/B testing results validation calculator. This is an advanced topic that most of you should probably skip, but it is important to explain exactly the maths and parameters behind the calculators for those who would like to compare them with other tools available online.

Common assumptions made in the equations below

All the equations below make 2 assumptions, which you should be comfortable with:

First, the tests are run until 95% statistical significance is achieved. What does this mean? Optimizely has a great knowledge base article on this. Key extract on statistical significance:

Statistical significance deals with the risk of false positives—in the situation when there is no difference between your original and your variation, it predicts how often your test results will accurately say that there is no difference, and how often your test results will incorrectly tell you there is a difference (a false positive). Usually tests are run at 95% statistical significance. This means that when there is no difference between your original and your variation, you will accurately measure no difference 95% of the time (19 out of 20 tests), and you will measure a false positive 5% of the time (1 out of 20 tests).

Bottom line: the higher your level of statistical significance, the less likely you are to conclude that your variation is a winner when it is not.

Second, the tests are 2-tailed and not 1-tailed as used, for example, in Optimizely’s online sample size calculator. Again, here is a great description of the difference, from the same Optimizely KB article.

When you run a test, you can run a 1-tailed or 2-tailed test. 2-tailed tests are designed to detect differences between your original and your variation in both directions—it will tell you if your variation is a winner and it will also tell you if your variation is a loser. 1-tailed tests are designed to detect differences between your original and your variation in only one direction.

At Optimizely, we use 1-tailed tests. We do this because we believe that 1) the ability to identify a “winner” is more valuable than the ability to identify a “loser” when choosing between an original and a variation and 2) this enables faster decision-making. The fastest way to help you identify your winners is to run a 1-tailed test. Because of this, the sample size calculator is set to 1-tailed tests by default.

However, you may want to run your experiment as a 2-tailed test, which will double the acceptable false positive rate. In our example, that means doubling 5% to 10%. Simply select the 2-tailed option in the calculator.

I disagree with Optimizely’s advice to use 1-tailed calculations, I would always prefer to use 2-tailed calculations.

Except otherwise specified, the calculators available on Marktisans use a Power level of 80%, but I give below the equations to reconstructs the data for a Power level of 95% as well (I personally use both in my client work, depending on each test). The same Optimizely KB article is again helpful to define Power level in a simple way:

Statistical power deals with the risk of false negatives—in the situation where there is a difference between your original and your variation, it predicts how often your test results will accurately say that there is a difference, and how often your test results will incorrectly tell you there is no difference (a false negative). Usually people run tests at 80% statistical power. This means that 80% of the time (8 out of 10 tests), a winning variation will be identified as a winning variation. In other words, 20% of the time (2 out of 10 tests), you will not detect a winning variation even though you have one.

Bottom line: the higher your level of statistical power, the less likely you are to miss out on identifying a winning variation.

Given these parameters, you will find below the equations used to compute the data available through the calculators. I have given equations to compute each variable, but of course this is the same equation, simply rearranged to express each avriable in turn.

Variables used in the equations are:

Variations: the number of NEW variations tested, not including the Control version
CR: the current Conversion Rate of the page(s) tested, i.e. your Baseline Conversion Rate
Performance: the relative increase of the Conversion Rate that your winning variation is seeing. For example, if your Baseline CR is 5% and the CR of your winning variation is 5.5%, then the Variation Performance is 10%, since your variation increases your CR by 10%.
UVs: the minimum sample size to reach for your test to be statistically valid, measured in Unique Visitors tested.

Expressions for Power = 80%

\[
\text{UVs} = 16 \times (\text{Variations} +1) \times \left(\frac{\sqrt{\text{CR} \times (1-\text{CR})}}{\text{CR} \times \text{Performance}}\right)^2
\]

\[
\text{Performance} = 4 \times \left(\frac{- (\text{Variations} +1) \times (\text{CR} – 1)}{\text{CR} \times \text{Performance}}\right)^{\frac{1}{2}}
\]

\[
\text{Variations} = – \frac{\text{CR} \times \text{UVs} \times \text{Performance}^2 + 16(\text{CR} – 1)}{16(\text{CR} – 1)}
\]

\[
\text{CR} = \left(\frac{\text{UVs} \times \text{Performance}^2}{16(\text{Variations} + 1)} + 1\right)^{-1}
\]

Expressions for Power = 95%

\[
\text{UVs} = 26 \times (\text{Variations} +1) \times \left(\frac{\sqrt{\text{CR} \times (1-\text{CR})}}{\text{CR} \times \text{Performance}}\right)^2
\]

\[
\text{Performance} = 26^{\frac{1}{2}} \times \left(\frac{- (\text{Variations} +1) \times (\text{CR} – 1)}{\text{CR} \times \text{Performance}}\right)^{\frac{1}{2}}
\]

\[
\text{Variations} = – \frac{\text{CR} \times \text{UVs} \times \text{Performance}^2 + 26(\text{CR} – 1)}{26(\text{CR} – 1)}
\]

\[
\text{CR} = \left(\frac{\text{UVs} \times \text{Performance}^2}{26(\text{Variations} + 1)} + 1\right)^{-1}
\]

Photo credit: Alex Graves on Flickr[/fusion_builder_column][/fusion_builder_row][/fusion_builder_container]

Share the Post:

Responsive & Adaptive design choices explained as a vegetarian restaurant experience

Mobile traffic to websites has been rapidly increasing for the past few years and it’s not rare these days to

A 5-mins morning workout anyone can do that gets long-term results

Fair warning: personal post below! I’ve always been very active physically thanks to my father who is a sports teacher.