Why A/B Testing Too Much Leads to Bad Science

So you want test and experiment with your apps to optimize the user experience, aka do science on your apps. But, as it turns out, that’s actually harder than it looks. Gathering data used to be the hard part, but with the improvement of analytics tools, it’s getting easier and easier. Now, the difficulty lies in experimental design. Testing too much will give false positive results, so it’s critical to be selective with what and how you test.

First of all, why is experimental design so easy to screw up? It has a lot to do with uncertainty. All tests have finite time and sample size, so you have to set your standards:

How certain do you want to be of discovering effects when they exist? This is called statistical power, and the usual standard is .8 — you run the test long enough to have an 80% chance of discovering the effect. In mobile apps, we can usually gather a lot of data, so we can boost that to 95%.
How certain you are that the effect is real. This is your significance threshold, and the usual value is .05 — only 5% of the time will you conclude that an effect is real when it’s actually just chance.
How large you expect the change over the baseline to be. This is your effect size, and the smaller it is, the more data you will have to gather. A small effect (e.g. 1%) might take 4 times as much data as a medium effect (e.g. 5%), which might take 4 times as much data as a large effect (e.g. 15%).

As this helpful chart in the Economist explains, even if you use these scientifically accepted standards, you can still end up making a disconcerting number of mistakes. In fact, assuming only 1 out of 10 hypotheses is true — or if only 1 out of 10 tests has a truly winning variant (to put it in A/B terms) — you’ll still end up with almost half of your positive results being false!

So, statistics tells us that most science is wrong.

Actually, the way most people do statistics is wrong too. If it tickles your fancy, you can read an intensely detailed synopsis of the many pitfalls of scientific statistics. Luckily for you, the Apptimize framework makes it easy to do tests right. But, there are still a few important pitfalls you should take care to avoid. One of the most tempting errors is inviting false positives through multiple hypothesis testing. That’s what leads us to the frightening scenario described above.

Multiple Hypothesis Testing, or, Pescatarians

When you run a ton of tests, that’s called multiple hypothesis testing. The problem with this approach, as shown above, is that if you test all possible combinations of several variables, some of them will seem to show a significant correlation simply by chance. If you really do need to run many tests, you can correct for the false positives in a variety of ways, from the relatively lenient to the draconian. But beware — any method of multiple hypothesis correction decreases the power of each test, and you might end up burying valid results beneath the correction threshold. Typically, the best solution is, rather than to test and correct, not to open yourself up to that danger at all.

Let’s see an example of how multiple hypotheses can get out of hand. Say you collected user properties including age, gender, education level, height, and dietary restrictions, and we made a two-variant test for each. After combining enough filters, regardless of whether you have any actual causal connections in your underlying population, you’re statistically bound to find something. Perhaps you’ll find that 6’1″ male pescatarians with a bachelor’s degree who are between the ages of 23 and 25 have a special propensity to click the red button instead of the blue one! In the modern age of apps, where we enjoy the luxury and suffer the curse of too much data, we are especially prone to being led astray by the ghosts of illusory significance.

Fortunately, you can combat this curse by augmenting the machines’ intelligence with your own. Just take a manageable set of changes you think might help your metrics, test them, and you’ll run a much lower risk of falsely significant results. In other words, don’t run twenty tests targeted by three different parameters – run two or three larger tests, targeted by one or two. The same goes for variant design; two to four will serve better than dozens.

This kind of moderation saves you time and effort, gives you statistical power, and improves the chances that the differences you find are true ones. Of course, through Apptimize, you can still discover the outliers: our results dashboard lets you see when a metric you weren’t thinking about skyrockets or plunges. But in the default case, you’ll likely find that sensible theories and smart guessing leads to good results.

Now, go forth and test, and remember: With great science comes great responsibility.

Why A/B Testing Too Much Leads to Bad Science

Multiple Hypothesis Testing, or, Pescatarians

More articles you might be interested in:

When to Use Multivariate Testing vs. AB Testing

How to Increase App Upgrade Revenue with A/B Testing

Why Many Apps Are Leaving Money on the Table

Schedule A Customized Demo

About Apptimize

Get Feature Flags for Free

About Apptimize

Navigation

Products

About