How Many Users Do I Need for My A/B Test? Calculate Sample Size and Run Times


Confusing Statistics

While the basic concept of planning an A/B test is pretty straight forward, two of the more complex questions that we often get are:

  • What percentage of my users should I put in my test?
  • How long will it take to decide which variant is best?

While it’s a perfectly fine strategy to deploy the test to all your users and wait for conclusive results to appear, understanding the basics behind A/B testing analytics can give you some benchmarks to work with.

The Basics

A/B testing analyzes the behavior of a random sample of your user population, and makes statistical predictions about the group as a whole. The larger the sample size, the more accurate the results.

If we wanted near certainty, we’d run every test with huge samples to increase confidence in the results. However in most cases, it’s just not practical. So how many do users do you really need to get statistically significant results? In other words, what’s an adequate sample size for my test?

First of all, let’s define “statistically significant difference.” A difference between variants is statistically significant when the chance of at least that big a difference showing up randomly is less than 5% [1]. Say our test is changing the color of the “Buy” button. Our baseline might be a blue button (A), and our variant a green button (B), The metric we’re tracking is what fraction of users click the button (the conversion rate). If B is significantly better than A at achieving clicks, it means that we can say with 95% certainty that B’s conversion rate for ALL users is higher, no matter the sample size.

The larger the difference between variants, the more confident you can be that the results are statistically significant with fewer samples.

Predicting how many users we need depends on a few factors, such as how big we think the difference will be between variants, how many variants there are, and what the conversion rates are. Here is a table of roughly how many users per variant (including baseline) we recommend using at the start of your test:

Large difference between variants (~50% lift)1,50050050

Low conversion rates (~5%) Medium conversion rates (~15%) High conversion rates (~70%)
Small difference between variants (~10% lift) 40,000 10,000 700
Medium difference between variants (20% lift) 8,000 2,500 150

(The stats section below explains how we got these numbers.)

Calculating Runtimes

From there, you can calculate about how long you will need to run your test to see conclusive results:

Duration (weeks) = (Nv · Nu ) / (p · M/4 )

Nv = number of variants

Nu = number of users needed per variant (from the above table)

p = fraction of users in this test (e.g., if this test runs on 5% of users, p = 0.05)

M = MAU (monthly active users)

For instance, say you have two variants (baseline, plus one other), 100,000 MAUs, a current conversion rate of 10%, and an expected effect size of 20% (you expect the conversion rate of the new variant to be 12%). Then, if you run the test on 20% of your users, you will probably see conclusive results in about a week and a half.

Extra Credit: The Stats

Although the conversion rate for each of your variants appears as a single value, it would be more correct to think of it as a range. This range is the confidence interval which is determined by the following equation:

CI = cr ± z · √cr(1 – cr)/n

cr = conversion rate

z = a multiple determined by the cumulative normal distribution function

n = sample size

The multiple z is determined by how certain you want to be. If we want to be 95% certain that the real value of the conversion rate falls in the confidence interval, z is equal to 1.96.

Let’s look at an example. If 4% of the 50 users who saw variant A clicked metric M, we know if we showed A to all our app users, the real conversion rate is probably somewhere around 4%, maybe slightly higher, maybe slightly lower. This range is the confidence interval which we calculate to be 0 to 9.4%. If 6% of the 50 users who saw variant B clicked on metric M, we know with 95% certainty that B’s conversion rate is somewhere between 0 and 12.6%. The two conversion rate ranges actually overlap. The test is inconclusive and we need more users to test. We can either test more users by expanding the test to a higher percentage of our totally monthly active users (MAU) or running the test for a longer period of time.

About Apptimize

Apptimize is an innovation engine that provides A/B testing and feature release management for native mobile, web, mobile web, hybrid mobile, OTT, and server. Industry leaders like HotelTonight, The Wall Street Journal, and Glassdoor have created amazing user experiences with Apptimize.

Thanks for