How Khan Academy Uses A/B Testing to Improve Student Learning



Today’s blog post is an interview with Alan Pierce, a software engineer at Khan Academy on the infrastructure team. Alan, a software and data expert, spent the last 6 months completely re-thinking and re-writing Khan Academy’s internal A/B testing tool to increase speed and functionality so that the rest of the team can easily create A/B tests to optimize student learning. In this post, Alan and I talk about what he learned about A/B testing and A/B testing infrastructure. His words sounded all too familiar to our engineers’ ears.

Apptimize: What are some common A/B testing mistakes you’ve seen?

Alan: The biggest mistake I’ve seen is people jumping to conclusions too readily. It’s easy to come up with an intricate story for why the numbers look the way they look, when actually there isn’t any statistically significant difference, or the difference is due to a bug rather than an interesting user behavior. There are also lots of statistical mistakes that are easy to make. For example, it’s tempting to search through the big list of conversions until you find one that has a statistically significant difference. But if you do that, you’re bound to find differences that just appear statistically significant due to chance.

To avoid this mistake, you should think carefully about your tracked conversions and experiment length before starting an experiment, rather than taking a “try it and see what happens” approach. A great way to build intuition is to run an A/A test, where you don’t actually change anything and think about what conclusions you would have made if you thought it was a real experiment.

Another mistake I’ve seen is not thinking through all of the possible unintended side-effects when you give a different experience to different users. For example, Khan Academy is often used in classrooms, where you might have a teacher explaining a task to a class of students. If a button is missing or in the wrong place for some students, but not others, it can lead to a lot of confusion. Students already list their teacher on Khan Academy for the coaching tools to work correctly, so we can use that information to make sure that everyone in a classroom always ends up in the same A/B test alternative.

Our translations also provided another case of unintended side-effects. So far, Khan Academy has been doing translations on a best-effort basis, which means we display the English string when a translation isn’t available. This means that if a developer wants to reword some text and run the change through an A/B test, it’ll look fine on the English site, but on the Spanish site, you’ll be comparing a group seeing the original Spanish text and a group seeing the new untranslated English text. Not only does this hurt the user experience, it also skews the results for the entire experiment. To help avoid this mistake, I made our A/B tests only affect the English site by default, but made the system flexible enough that we can run tests for all users or for any particular language.

Apptimize: What is the most interesting A/B test you’ve seen so far?

Alan: In one experiment, we tried to create a sense of community by showing a message like “17 other users are working on this exercise now” when students were doing exercises. The hope was that these messages would improve user engagement. What was interesting was that we didn’t need to fully implement the feature to try it out. Instead, we did some log analysis to get some realistic numbers for each exercise and displayed those numbers, rather than implementing a real-time system. The experiment ended up not improving engagement numbers very much, so we scrapped the idea instead of wasting a bunch of developer time implementing the real feature.

In another case, the ability to do custom analysis and dig into the data after-the-fact ended up saving the day. When our problem recommendation system thinks that an exercise would be way too easy for you, it will sometimes give you a “Challenge Card” question, and getting that question right fast-forwards your progress on that exercise to the highest level instead of requiring you to prove yourself over and over. More challenge cards means more opportunity to quickly get up to your learning edge, so in a recent experiment, we tried lowering the threshold so that more challenge cards would be issued. Unfortunately, after running the test for over a month, the results didn’t look very promising: our custom “learning gain” metric was lower on average, so we were about to call the experiment a failure. But then, the data science team lead, Jace Kohlmeier, dug into the data some more and found a better way of interpreting it. Basically, giving out more challenge cards meant that more barely-active users were eligible for the learning gain statistic, which made the average learning gain a little lower. By applying a complex reweighting and recomputing the learning gain, Jace was able to remove this effect. The result was that the learning gain was actually higher with more challenge cards. Several other people on the data science team verified this analysis, and the experiment was declared a success.

Apptimize: When did Khan Academy realize the need to start A/B testing and how has the organization changed since you began testing?

Alan: Khan Academy has pretty much always known that the way to get the best possible learner experience is to do A/B testing. The first full-time developer (and now the dev lead), Ben Kamens, was already fully aware of the advantages of of A/B testing when he joined in 2010. For the first view months, there were higher priorities, but soon afterward, he wrote the first version of the A/B testing framework and started running A/B tests. This original version of the framework was designed to make A/B testing as easy as possible so that it would catch on, and it certainly accomplished that.

Since the early days, we’ve learned a lot about how to run A/B tests and the right way to interpret their results. We gained much more confidence in our A/B testing by running A/A tests and by taking a closer look at strange experiment results to figure out what was really going on, and we were able to make a number of fixes and incremental improvements to the A/B testing framework as a result. When I joined Khan Academy, A/B testing was a well-established norm, but the A/B testing framework had lots of quirks and performance issues because it was written so early in the organization’s lifetime, which is what prompted the rewrite. The opportunity to improve education through real empirical data was one of the things that drew me to Khan Academy in the first place, so I’m excited to see what improvements we’ll come up with in the future.

Apptimize: What are the most important features an A/B testing solution must have?

Alan: In addition to the basic ability to run A/B tests and view the results, a good A/B testing solution should make it easy to explore all of the A/B tests that are going on in your organization and how they are progressing. In Khan Academy’s old A/B testing framework, there was a dashboard that had a big list of results for all active experiments, but the only explanation for an experiment was its name, which was usually a generic name like “Video page redesign v3”. With our new system, every experiment needs to have a title, an owner, a description, and (when it’s finished) a conclusion. I was originally worried that people wouldn’t take the time to fill out descriptions, but in practice, they’re often several paragraphs and explain the motivation behind the experiment, the hypothesis, and any links or images that help in explaining the change.

Probably the most unexpectedly-popular feature was the notification system: every time an experiment starts or stops, the system sends an email and IM notification to the whole team with the experiment’s description and a link to the latest results. Since most interesting changes to the site are done through A/B tests, it’s an easy way for anyone in the organization to see what ideas the developers are trying out, and which ones are successful. The notifications give A/B testing much more of a presence, especially to new people, and I think they’ve made Khan Academy’s A/B testing culture even stronger.

Apptimize: What kinds of challenges have you faced while developing Khan Academy’s A/B testing framework?

Alan: The fact that it was a rewrite of an existing system ended up acting as sort of a double-edged sword. The old A/B testing system was popular and had a long history, so it was nice to be able to run ideas by people and get good feedback, and it was nice to be able to look at the experiments in the old system to get a sense of the use cases and the scale that I would need to handle.

The downside to the rewrite was that I needed to transition the experiments from the old framework instead of just working from scratch. I wanted to be confident that the new system was equivalent to the old one, so I built the new system to be compatible with the old one. I ran every Khan Academy experiment side-by-side in both systems for a month or two while I focused on making the system more stable and worked on the new dashboard frontend. When I was confident that the new system was working correctly, I switched all of our experiments to exclusively use the new system, and eventually completely shut off the old system. The whole time, I had to keep the dev team up-to-date on how to work with the dual system and what to expect from the new API.

Another challenge was that I wasn’t expecting such a wide range of use cases. Even though the framework is only used internally within Khan Academy, I still had to appeal to a bunch of different audiences since we run many different types of experiments. Many experiments make a small tweak to a user interface in the hopes of improving easily-measurable metrics, such as the number of math problems attempted, so I needed that case to be easy. In other experiments, though, the developer running the experiment is trying to improve a difficult-to-measure metric like overall learning and wants to run custom analysis, so I needed to make it possible to dig into the raw data and cross-reference the A/B test data with any other interesting data. Other experiments try to optimize specific user workflows and want to do things like funnel analysis.

One experiment I saw invited a few hundred specific users to beta test a new feature, and the change to the user experience was so dramatic that just about every metric was worth looking at. In other cases, we make changes that we don’t expect to improve any short-term measurable metrics, and the A/B testing system is just used to roll the feature out to new users and make sure it doesn’t hurt any metrics. In one or two other cases, the A/B testing system is just used for tracking analytics without actually running an experiment.

If I only had to solve one or two of these problems, I probably could have made some additional simplifying assumptions about the scale parameters, the metrics to collect, or the UI to show for an experiment, but I ended up needing to have a reasonable solution to all of these use cases.

About Apptimize

Apptimize is an innovation engine that provides A/B testing and feature release management for native mobile, web, mobile web, hybrid mobile, OTT, and server. Industry leaders like HotelTonight, The Wall Street Journal, and Glassdoor have created amazing user experiences with Apptimize.

Thanks for