Etsy’s Culture Of Continuous Experimentation and A/B Testing Spurs Mobile Innovation

Apptimize

I’ve been a personal fan of Etsy for a long time (half the art in my house was found while lying in bed with the Etsy app on my iPad). I’ve always been impressed with how they’ve managed to develop a great experience for me on all screens. How do they do it?

Turns out, it has a lot to do with their culture of continuous experimentation, which Etsy’s iOS Software Engineer, Lacy Rhoades, discussed in this talk at Breaking Development Nashville last October. We had the pleasure of meeting Lacy to get a deep dive on how Etsy does A/B testing, rolls out new features, and creates an experience that their customers really love.

LacyRhoades

Follow Lacy @lacyrhoades

Apptimize: What’s the best or most exciting experiment you’ve seen?

Lacy: The biggest successes we’ve had recently have been with experiments meant to test improvements for mobile users. In one experiment, we changed the default interface for tablet users to be more inline with our desktop experience. This turned out to be a big improvement. Another thing has been improving the mobile checkout flow. We saw that asking for less information from some users during checkout lifted how likely they were to purchase items on mobile devices.

A: What are some A/B testing best practices that you use at Etsy?

L: Designing experiments correctly is very important. The infrastructure and the tooling is fun to work with, but you’ve got to know the right question to ask before you can find good answers. We make sure to check the size of the audience as a first step. If the exposure of an experiment is too small, too much time will have to pass in order to prove any results with statistical significance. This can be the biggest challenge. (Note from Apptimize: check out our recent blog post on estimating audience size and time required for running A/B tests.)

Be selective. If the group we select as eligible for an experiment is a biased selection or that audience becomes biased over time, it can ruin any results we get from our tests. If it’s more likely that highly engaged users will become enrolled in an A/B experiment, that experiment is going to show stronger–and usually misleading–results. Even small groups of users can affect the outcome of an experiment.

Be patient. An experiment isn’t something you can tack on at the end of designing a product feature. It has to be baked in from the onset. Once the experiment is in place, chances are it’s going to take some time before you can statistically prove anything, so be ready to plan parallel work on other products or other features.

Be discreet. Publicizing an experiment can quickly bias users and spoil an experiment’s outcome. We try to be transparent about what work we’re doing to make our product better in the end, but at the same time, alerting the audience to the presence of an experiment means we’re inherently changing the outcome of the test.

Be reasonable. Don’t expect a breakaway success. If your experiments go well, often times they will result in incremental changes to your product. This is what you want. Making small changes reduces risk, and ensures that you’re headed in the right direction with your development work.

A: How did a culture of continuous experimentation start at the company?

L: Experimentation at Etsy comes from a desire to make informed decisions, and ensure that when we launch features for our millions of members, they work. Too often, we had features that took a lot of time and had to be maintained without any proof of their success or any popularity among users. A/B testing allow us to tinker with small pieces and measure if those pieces are moving in the right direction. We can say a feature is worth working on as soon as it’s underway, or even before, having measured the impact of small changes on our buyer and seller experiences.

A: How do you coordinate experiments in a large team?

L: Our teams are divided such that they usually only cover one product or user experience.This separation into vertically-integrated features makes for easier experiments. Still, users can potentially come across several different features in one visit, part of the experiment design at the onset is meeting with other teams and considering the liability or influence their experiments will have on yours. Most importantly, you want to consider other experiments that might influence or confuse your results.

When experiments are meant for similar audiences, this requires teams to agree ahead of time to funnel users into groups, before they are assigned to an experiment or a control. For example, we might randomly funnel 50% of users into a group (with no result or no effect on their experience). This “invisible” group becomes a label, and that audience will be reserved for one team to run their experiment. Any other team would be required to leave them out of new tests. This cuts down the audience size for an experiment, which is a drawback, but it can make the experiment design more straightforward.

A: How do product managers, designers, etc. fit along with developers into the experimentation method?

L: At Etsy, we’re really lucky to have product managers and designers who incorporate the idea of experiments into their plans. Oftentimes, product ideas and designs will rely heavily, from day one, on being able to test and run experiments. Designers love to be able to quantify that good design really does make things better. Product managers often want to have graphs to prove the success or impact of changes over time. Without experiments, we would be stuck overhauling an entire product, and then doing our best to look back and make a side-by-side comparison of metrics where one can’t really be made.

A: How do you decide what to test next? How do you prioritize experiments in a continuous cycle?

L: Testing at Etsy usually comes after an idea we have to improve an existing product, like a new feature. The first step is to break the new idea apart into smaller, testable components. We implement or research a component to verify that we’re onto something and that we’re making a net-positive or net-neutral impact on things like conversion or engagement.

A: How often do experiments have a significant result?

L: Most of the time we find out at least a handful of things from an experiment. We measure experiments in terms of their effect on key metrics like conversion, order value, bounce rate or user engagement. It’s very rare that a change won’t affect at least one of these significantly. A lot of the time the experiments reveal that our ideas need more thought and this is how experiments can manage to save us a lot of time in development.

About Apptimize

Apptimize is an innovation engine that provides A/B testing and feature release management for native mobile, web, mobile web, hybrid mobile, OTT, and server. Industry leaders like HotelTonight, The Wall Street Journal, and Glassdoor have created amazing user experiences with Apptimize.

Thanks for
reading!