Apptimize
Jan 28, 2014
While the basic concept of planning an A/B test is pretty straight forward, two of the more complex questions that we often get are:
While it’s a perfectly fine strategy to deploy the test to all your users and wait for conclusive results to appear, understanding the basics behind A/B testing analytics can give you some benchmarks to work with.
A/B testing analyzes the behavior of a random sample of your user population, and makes statistical predictions about the group as a whole. The larger the sample size, the more accurate the results.
If we wanted near certainty, we’d run every test with huge samples to increase confidence in the results. However in most cases, it’s just not practical. So how many do users do you really need to get statistically significant results? In other words, what’s an adequate sample size for my test?
First of all, let’s define “statistically significant difference.” A difference between variants is statistically significant when the chance of at least that big a difference showing up randomly is less than 5% [1]. Say our test is changing the color of the “Buy” button. Our baseline might be a blue button (A), and our variant a green button (B), The metric we’re tracking is what fraction of users click the button (the conversion rate). If B is significantly better than A at achieving clicks, it means that we can say with 95% certainty that B’s conversion rate for ALL users is higher, no matter the sample size.
The larger the difference between variants, the more confident you can be that the results are statistically significant with fewer samples.
Predicting how many users we need depends on a few factors, such as how big we think the difference will be between variants, how many variants there are, and what the conversion rates are. Here is a table of roughly how many users per variant (including baseline) we recommend using at the start of your test:
Large difference between variants (~50% lift)1,50050050
Low conversion rates (~5%) | Medium conversion rates (~15%) | High conversion rates (~70%) | |
Small difference between variants (~10% lift) | 40,000 | 10,000 | 700 |
Medium difference between variants (20% lift) | 8,000 | 2,500 | 150 |
(The stats section below explains how we got these numbers.)
From there, you can calculate about how long you will need to run your test to see conclusive results:
Duration (weeks) = (Nv · Nu ) / (p · M/4 )
Nv = number of variants
Nu = number of users needed per variant (from the above table)
p = fraction of users in this test (e.g., if this test runs on 5% of users, p = 0.05)
M = MAU (monthly active users)
For instance, say you have two variants (baseline, plus one other), 100,000 MAUs, a current conversion rate of 10%, and an expected effect size of 20% (you expect the conversion rate of the new variant to be 12%). Then, if you run the test on 20% of your users, you will probably see conclusive results in about a week and a half.
Although the conversion rate for each of your variants appears as a single value, it would be more correct to think of it as a range. This range is the confidence interval which is determined by the following equation:
CI = cr ± z · √cr(1 – cr)/n
cr = conversion rate
z = a multiple determined by the cumulative normal distribution function
n = sample size
The multiple z is determined by how certain you want to be. If we want to be 95% certain that the real value of the conversion rate falls in the confidence interval, z is equal to 1.96.
Let’s look at an example. If 4% of the 50 users who saw variant A clicked metric M, we know if we showed A to all our app users, the real conversion rate is probably somewhere around 4%, maybe slightly higher, maybe slightly lower. This range is the confidence interval which we calculate to be 0 to 9.4%. If 6% of the 50 users who saw variant B clicked on metric M, we know with 95% certainty that B’s conversion rate is somewhere between 0 and 12.6%. The two conversion rate ranges actually overlap. The test is inconclusive and we need more users to test. We can either test more users by expanding the test to a higher percentage of our totally monthly active users (MAU) or running the test for a longer period of time.
Thanks for
reading!
Is iOS 8 going to launch next week? We don’t know, but we’re ready if it does! Here at Apptimize labs, we’ve thoroughly tweaked and tested things to make sure that A/B testing on iOS 8 is as seamless as...
Read MoreWe hear from customers that planning out your second, third, and fourth A/B tests is one of the hardest things to. Many app managers have a first test in mind when they start experimenting and planning out a series of...
Read MoreDid you know that the Visual Apptimizer not only allows you to change copy and buttons without programming but you can also change images on the fly? This means you can swap out images in your app without having to...
Read More