This post explains in detail the parameters and maths behind the online calculators available on this website: the A/B testing strategy planning calculator and the A/B testing results validation calculator. This is an advanced topic that most of you should probably skip, but it is important to explain exactly the maths and parameters behind the calculators for those who would like to compare them with other tools available online.

Common assumptions made in the equations below

All the equations below make 2 assumptions, which you should be comfortable with:

First, the tests are run until 95% statistical significance is achieved. What does this mean? Optimizely has a great knowledge base article on this. Key extract on statistical significance:

Statistical significance deals with the risk of false positives—in the situation when there is no difference between your original and your variation, it predicts how often your test results will accurately say that there is no difference, and how often your test results will incorrectly tell you there is a difference (a false positive). Usually tests are run at 95% statistical significance. This means that when there is no difference between your original and your variation, you will accurately measure no difference 95% of the time (19 out of 20 tests), and you will measure a false positive 5% of the time (1 out of 20 tests).

 

Bottom line: the higher your level of statistical significance, the less likely you are to conclude that your variation is a winner when it is not.

Second, the tests are 2-tailed and not 1-tailed as used, for example, in Optimizely’s online sample size calculator. Again, here is a great description of the difference, from the same Optimizely KB article.

When you run a test, you can run a 1-tailed or 2-tailed test. 2-tailed tests are designed to detect differences between your original and your variation in both directions—it will tell you if your variation is a winner and it will also tell you if your variation is a loser. 1-tailed tests are designed to detect differences between your original and your variation in only one direction.

 

At Optimizely, we use 1-tailed tests. We do this because we believe that 1) the ability to identify a “winner” is more valuable than the ability to identify a “loser” when choosing between an original and a variation and 2) this enables faster decision-making. The fastest way to help you identify your winners is to run a 1-tailed test. Because of this, the sample size calculator is set to 1-tailed tests by default.

 

However, you may want to run your experiment as a 2-tailed test, which will double the acceptable false positive rate. In our example, that means doubling 5% to 10%. Simply select the 2-tailed option in the calculator.

I disagree with Optimizely’s advice to use 1-tailed calculations, I would always prefer to use 2-tailed calculations.

Except otherwise specified, the calculators available on Marktisans use a Power level of 80%, but I give below the equations to reconstructs the data for a Power level of 95% as well (I personally use both in my client work, depending on each test). The same Optimizely KB article is again helpful to define Power level in a simple way:

Statistical power deals with the risk of false negatives—in the situation where there is a difference between your original and your variation, it predicts how often your test results will accurately say that there is a difference, and how often your test results will incorrectly tell you there is no difference (a false negative). Usually people run tests at 80% statistical power. This means that 80% of the time (8 out of 10 tests), a winning variation will be identified as a winning variation. In other words, 20% of the time (2 out of 10 tests), you will not detect a winning variation even though you have one.

 

Bottom line: the higher your level of statistical power, the less likely you are to miss out on identifying a winning variation.

Given these parameters, you will find below the equations used to compute the data available through the calculators. I have given equations to compute each variable, but of course this is the same equation, simply rearranged to express each avriable in turn.

Variables used in the equations are:

  • Variations: the number of NEW variations tested, not including the Control version
  • CR: the current Conversion Rate of the page(s) tested, i.e. your Baseline Conversion Rate
  • Performance: the relative increase of the Conversion Rate that your winning variation is seeing. For example, if your Baseline CR is 5% and the CR of your winning variation is 5.5%, then the Variation Performance is 10%, since your variation increases your CR by 10%.
  • UVs: the minimum sample size to reach for your test to be statistically valid, measured in Unique Visitors tested.

Expressions for Power = 80%

    \[ \text{UVs} = 16 \times (\text{Variations} +1) \times \left(\frac{\sqrt{\text{CR} \times (1-\text{CR})}}{\text{CR} \times \text{Performance}}\right)^2 \]

    \[ \text{Performance} = 4 \times \left(\frac{- (\text{Variations} +1) \times (\text{CR} - 1)}{\text{CR} \times \text{Performance}}\right)^{\frac{1}{2}} \]

    \[ \text{Variations} = - \frac{\text{CR} \times \text{UVs} \times \text{Performance}^2 + 16(\text{CR} - 1)}{16(\text{CR} - 1)} \]

    \[ \text{CR} = \left(\frac{\text{UVs} \times \text{Performance}^2}{16(\text{Variations} + 1)} + 1\right)^{-1} \]

Expressions for Power = 95%

    \[ \text{UVs} = 26 \times (\text{Variations} +1) \times \left(\frac{\sqrt{\text{CR} \times (1-\text{CR})}}{\text{CR} \times \text{Performance}}\right)^2 \]

    \[ \text{Performance} = 26^{\frac{1}{2}} \times \left(\frac{- (\text{Variations} +1) \times (\text{CR} - 1)}{\text{CR} \times \text{Performance}}\right)^{\frac{1}{2}} \]

    \[ \text{Variations} = - \frac{\text{CR} \times \text{UVs} \times \text{Performance}^2 + 26(\text{CR} - 1)}{26(\text{CR} - 1)} \]

    \[ \text{CR} = \left(\frac{\text{UVs} \times \text{Performance}^2}{26(\text{Variations} + 1)} + 1\right)^{-1} \]

 

Photo credit: Alex Graves on Flickr

2015-01-09T15:27:52+00:00

About the Author:

Julien Le Nestour
Applied behavioral scientist & international consultant — I am using the results and latest advances from the behavioral sciences—specifically behavioral economics—to help companies solve strategic issues. I am working with both start-ups and Fortune 500 groups, and across industries, though I have specific domain knowledge in banking, asset management, B2B and consumer IT, SAAS and e-commerce industries.

9 Comments

  1. Brian Craft 04/03/2015 at 2:31 am - Reply

    Hey Julien,

    Great post! Could you give some context around where the constants 16 and 26 come from in the UV calculations? Thanks.

    • Stanislav Sopov 10/01/2016 at 2:11 am - Reply

      2 * (t_0.975 + t_0.8)^2 ~ 16
      2 * (t_0.975 + t_0.95)^2 ~ 26

      t_0.025 – quantile of normal distribution at 0.025

      • RITESH RITURAJ NAYAK 03/23/2017 at 5:25 pm - Reply

        Hi Stanislav,
        if I were to compute the required sample size to test a one tailed hypothesis, I should replace 16 and 26 with 8 and 13 respectively, am I right ?

  2. Georgi Georgiev 05/02/2015 at 12:50 am - Reply

    Hi Julien,

    Stumbled accross your article while doing some research. Unfortunately I see you make a grave mistake that is, unfortunately all too common: “First, the tests are run until 95% statistical significance is achieved.”. This is a big no-no and a sure way to fail at AB testing. Having such a stopping rule is worse than not testing, cause pretty much all the results you’ll get will be illusory.

    Details on the above in the article below, check the “Statistical Power Mistake #2” part for this particular issue:

    http://blog.analytics-toolkit.com/2014/why-every-internet-marketer-should-be-a-statistician/

    • Julien Le Nestour 05/02/2015 at 12:59 am - Reply

      Hi Georgi –

      Many thanks for your comment.

      You’re missing my point though, as this post is explaining the maths behind the calculators I offer freely on the site, which gives you a minimum sample size to achieve before stopping a test…

      So no, I don’t advocate stopping only when reaching 95% stat. sig. In fact, you can read exactly what I say on stopping here 🙂 https://julienlenestour.com/long-run-ab-test/ There are many more parameters to take into account, not all of them statistical!

      • Georgi Georgiev 05/02/2015 at 1:09 am - Reply

        Yes, I just ran into a comment of yours in another blog where you argued with the author about stopping rules 🙂 However, I feel that users will generally be fooled by your wording in your post above, so you might want to consider clarifying it 🙂

  3. Aknorr 09/10/2016 at 8:52 am - Reply

    Hey there Julien,

    I just checked your math, and there is a mistake in your Performance calculations. First, this cannot be the solved equation since Performance is on both sides of the equation!

    Could you please review and update? I am also curious to look in more detail at these formulas, do you have more sources for these?

    Thanks!

    • Paris 01/13/2017 at 10:51 pm - Reply

      Hi Aknorr,

      Spending the last hours to demystify the maths and apply it in my own experiment,
      I noticed that you are right, Julien indeed has made a small mistake, probably a typo.
      In the Performance calculation, instead of Performance variable in the Denominator it should be UVs.
      If you take the UV equation and try to solve for Performance, you ll see the difference.
      It would be great if Julien can confirm this whenever possible!

      Cheers!

Leave A Comment