A/B Testing Glossary
A/B testing helps product managers to rapidly grow and optimize digital experiences. Our A/B testing glossary breaks down commonly used terms and concepts used when A/B testing across different channels – mobile, web, server and OTT.
A/A testing is a form of A/B testing, in which both the control and variation are identical. So rather than testing say a blue button against a red button, you would actually just test a blue button against a blue button.
If the versions are identical, then what is the point of testing them? There are several reasons why running one (or more) A/A test is actually a crucial first step before getting into A/B testing:
In practice, the process involves using A/B test tools to drive traffic toward two identical versions of the same page. Split testing identical versions of an app screen or webpage can help product teams identify and eliminate the variances that sometimes cause inaccurate results in A/B tests. It’s also beneficial for determining the baseline conversion rate before starting an A/B test or multivariate test.
There are three main benefits of running an A/A test:
It’s easy to imagine what A/B testing would be like. However, most teams find that reality is always more complicated and nuanced than the plan. Here are some of the most common learnings that Apptimize customers have gained from an A/A test:
In addition to these very important learnings, teams generally find that running through a dress rehearsal just helps them familiarize themselves with the new tools, the new way to read the data, and the new processes in a very low risk way.
In addition to a whole new way to doing things, your team also needs to become comfortable with a new way of viewing the data. Perhaps your team is already very adept at viewing events data in your analytics tool (like Amplitude or Adobe), but now everyone needs to become comfortable with layering statistical significance between two different versions of the app or website on top of the events, engagement, and retention data. And if you have other tools that track user acquisition, deep linking, push notifications, and/or email, A/A testing will also allow you to understand when to connect the data from different sources to get a better understanding of what users are doing.
Your new A/B testing tool should definitely give accurate data. This almost sounds like a no-brainer, but in reality, discrepancies between data sources is almost ubiquitous. Marketers know that you’ll even find differences sometimes between Google Analytics and Google Ads data. There are many reasons why this happens (see this summary). And A/A test will be critical in helping you uncover any data issues you need to watch out for when A/B testing.
In theory, your A/A test should yield no significant results. It’s common for the data to show some big swings in the first few days of an experiment when you only have a few users who have interact with the A/A test, but over time, those early false positives should disappear. This also doesn’t mean that you’ll see exactly the same conversion between two identical variants. The two variants might have different conversion rates but there should be no statistical significance between the two.
Your minimum sample size should be large enough to yield real results. This should be statistically proportionate to the ultimate site goal. It’s important to remember that A/A testing, like any other scientific method, will need to be repeated several times to better understand and evaluate the results. This will aid in establishing a baseline conversion rate for future A/B testing, and prevent the waste of time that comes with inaccurate initial results.
As with any type of testing, one can’t expect to see a full range of results immediately. To increase accuracy, A/A tests should be set to run long enough to allow for the collection of a broad spectrum of data. Analyzing only one portion of any time period is a fallacy, and opens the test up for inaccuracies or inconsistent results.
You’ll want to set your statistical significance threshold to a number that fits your needs. You can set this number at the project level. We recommend somewhere between 90-95%. As a benchmark, a 90% threshold represents a 1 in 10 chance of declaring a false positive when in fact the variations have no impact.