A/B testing (also known as split testing) is the process of comparing two or more versions of something to see which one performs better among users. The concept originates from the scientific method where scientists formulate a hypothesis and create an experiment to collect empirical data prove whether or not the hypothesis is true.
In the business world, marketers, product managers, and developers use the scientific method to create better versions of anything by launching more than one version of their product to live users in real-time and then collecting data to see which version actually improves their metrics. 50% of your users might see version A and 50% of your users might see version B for 2 weeks and whichever version drives more sales wins and becomes the permanent experience.
This webpage will specifically focus on one type of A/B testing: that of optimizing digital user experiences. Here are some examples of what we will focus on:
A/B testing takes the guesswork out of any optimization efforts and allows individuals, teams, and organizations to make data-driven decisions from user behaviors, shifting business focus from “we think” to “we know.” By comparing the results from an A/B test, you can confirm every change is likely to produce conclusive results. When used consistently, A/B testing can boost the overall user experience, leading to greater engagement, retention, and revenue.
Every A/B test begins with a hypothesis. For example, you might observe that customers who have registered for your loyalty program are more likely to buy products from your brand, but only 5% of your customers currently in the program. You hypothesize that the reason for this low loyalty rate is because the page promoting the program is difficult to find on your website and mobile app. If you make it more prominent so that every customer who has bought something is made aware of the benefits of joining the loyalty program, more customers will join, receive promotions, and buy more often.
This is a perfect case for A/B testing. You can now break down your hypothesis into the following sections:
This step is where most of the work is for your development team. With your hypothesis in hand, you now need to design and develop each variant. You’ll also need to track the KPIs you want to measure if you haven’t done so already.
There are some scenarios where you can build your new variants using a web or mobile visual editor without needing engineers. Here are some examples where that can work.
Once the variants have been programmed and the KPIs are tracked, the next step is to randomly select users to be participate in one of the variants. It’s important that users are unaware of their participation in the experiment or that other variations of the same test element exist. Each time they enter the website or app, they should be shown the same version consistently over the duration of the experiment.
Depending on the number of users you have and how dramatically different the KPIs end up being between the variants, you’ll likely the run the A/B test for a few days to a few weeks. Here’s a simple calculator to help you estimate how long you need to run a test for.
Best practice: Run the A/B test for at least a week so that you capture the differences in weekday and weekend traffic.
You don’t want to stop an A/B test the moment you see it reach statistical significance. With very few data points, you might see false positives at the start that can lead you to make the wrong call (Airbnb’s engineer team has a great post on Medium that explains why this can happen in detail). To be statistically rigorous, it’s important to estimate how long you think an A/B test should run for and then make sure you let it run for that whole time.
Estimating how long you need to run an A/B test depends on a few things and can get a little complicated, so we wrote a whole blog post specifically about that.
Once you’ve let the A/B test run for the predicted duration, you can stop. If you find that one of the variants is statistically significantly better than the rest, you can deploy that version to all your users.
Depending on how risky/important this change is, you might want to create a holdout group and measure the effects of the change for a long time. In the loyalty program example, you might find that variant B increases participation in the program, but you can’t easily track how it changes purchase behavior in a few weeks or months. You might create a hold out group like Pinterest did in this example to really understand if seeing variant B actually makes customers spend more.
On the other hand, you might not see a statistically significant change in the predicted time period. In that case, most product developers choose to keep the original. All things being equal, it’s usually better to just keep the one that most of your users have been seeing already.
This is the end of this A/B test but you’re not done. Now you can move on to iterate on that first hypothesis. Say you do find that actively pushing users to join the loyalty program (variant B) does in fact increase participation. Now, you can test different versions of that message. Maybe you put it on the homepage instead of the checkout funnel. Maybe you test different benefit messages, different sized pop-up boxes, or different color submit buttons. Maybe the holdout group shows that getting people to join the loyalty program doesn’t actually increase purchases after all. Maybe another team member suggests that you should get customers’ birthdate information so that you can offer birthday promotions. Now you can A/B test that against a loyalty program.
Never stop iterating.
The simple answer: whenever you’re changing something. While savvy marketers are the most avid supporters of split testing, product managers and developers should also implement its use more often. We propose the following four scenarios as critical examples of when to use A/B testing:
You may find the need to change either a service, feature, or plugin on your website or app. Before doing so, make sure you’re testing changes to see how it’ll affect the customer’s purchase process.
Paid advertising can be waste of money if you don’t make sure you retain those users. Rigorously A/B testing your onboarding experience can help capture those users more consistently for sustainable user acquisition and growth.
One of the quickest ways to increase your revenue is by increasing your product price, but it comes with lots of risks. To ensure that you’re not scaring away customers, test out different price points on your sales page through A/B testing.
One of the best ways to boost your conversion rates is through A/B testing. Figure out what element might drive better traffic/engagement and convert visitors, and experiment with that to figure out what and how to implement the change.
An untested website design may result in dipped traffic and declining conversion rates. Therefore, it’s integral that a new layout is tested and results analyzed before implementing final changes.
To achieve optimal results from an A/B test, it’s important to follow a framework that outlines clear goals and variations to test.
Whether testing an app or a website, the following are three best practices to keep in mind:
Do NOT make a change and track everything looking for any statistically significant change. If you are looking for 95% statistical significance and you randomly track 1000 KPIs, you’ll likely to find a few KPIs that seem to have had a statistically significant change for no reason.
Always have a hypothesis first.
For A/B testing, there needs to be a certain amount of traffic to bring valid results. Here is a handy calculator to see how many users you’d need to reach statistical significance.
Along the same vein, also make sure that each user is only a part of one test at a time. If you want to run two A/B tests at a time but don’t have enough users to split the group into two, run one test and then the other. There may be unintended contamination if each user sees more than one change at a time.
One of the major errors people make with A/B testing is ending the experiment early. Not only does ending a test early put your efforts to waste, but it also undermines the stats you’ve collected.