AB Split Testing investigates the impact of changing typically one aspect of your site, to discover how much uplift implementing such a change you could have on your conversions or click-throughs. There’s one word in that previous sentence that has more importance than you initially think, and that is “could”. There’s no guarantee that you will ever see that uplift.
To be able to measure how likely a potential uplift is, a statistical technique that we can use is hypothesis testing. But as always with statistics, there is terminology that needs to be understood to be able to properly understand and action those results.
Hypothesis testing
This is a statistical technique to detect whether there is no difference between two samples of data. In an AB Test, we are interested in whether our variation is better than the control. In other words, will the conversion rate be better for the variation than the control. The most difficult concept to grasp here is that a hypothesis only detects a lack of difference, rather than whether there is a difference.
Significance
This is basically the threshold at which we would consider there to be a significant difference, and is typically set at 0.1, 0.05 or 0.01. This significance level determines how much weight is given to the extreme instances of a test. If you choose a smaller significance level, there would be a much smaller margin for your test to be significant with an extreme result.
Confidence
Confidence is more commonly associated with confidence intervals and is isolated to your test, however it is directly related to significance. If you want to be 90% confident, then you would set your significance at a 0.1 level. Intuitively, this makes sense since if you want to be more confident that you have a significant test, then you want a smaller margin for extreme results. So, the confidence and significance level scale appropriately.
P-values
As we said earlier (and even used before), the significance level is a threshold and it’s the p-value that is the measuring stick here. Statistically, the p-value is the probability that the test is part of the null distribution. In normal AB Testing speak, this is the probability that there is not a difference between the variant and the control, and that the difference you’ve found is completely by chance.
Each one of these plays a part in a basic AB Test, from constructing your hypothesis, to conducting your test and analysing the results. It’s important to understand that with a hypothesis test, the test will never tell you whether there is a difference, only evidence to suggest there is no difference. And even then, you control whether you determine a test is a significant test or not.
Summary:
- Hypothesis tests are what you are testing
- Confidence and Significance are not the same but linked terms
- P-value is the probability that you achieved an extreme result and there is no difference between variants