2013 OnlineControlledExperimentsatLa

(Kohavi et al., 2013) ⇒ Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. (2013). “Online Controlled Experiments at Large Scale.” In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ISBN:978-1-4503-2174-7 doi:10.1145/2487575.2488217

Subject Headings:

Notes

Cited By

Quotes

Author Keywords

A/b testing; controlled experiments; experimental design; randomized experiments

Abstract

Web-facing companies, including Amazon, eBay, Etsy, Facebook, Google, Groupon, Intuit, LinkedIn, Microsoft, Netflix, Shop Direct, StumbleUpon, Yahoo, and Zynga use online controlled experiments to guide product development and accelerate innovation. At Microsoft's Bing, the use of controlled experiments has grown exponentially over time, with over 200 concurrent experiments now running on any given day. Running experiments at large scale requires addressing multiple challenges in three areas: cultural / organizational, engineering, and trustworthiness. On the cultural and organizational front, the larger organization needs to learn the reasons for running controlled experiments and the tradeoffs between controlled experiments and other methods of evaluating ideas. We discuss why negative experiments, which degrade the user experience short term, should be run, given the learning value and long-term benefits. On the engineering side, we architected a highly scalable system, able to handle data at massive scale: hundreds of concurrent experiments, each containing millions of users. Classical testing and debugging techniques no longer apply when there are billions of live variants of the site, so alerts are used to identify issues rather than relying on heavy up-front testing. On the trustworthiness front, we have a high occurrence of false positives that we address, and we alert experimenters to statistical interactions between experiments. The Bing Experimentation System is credited with having accelerated innovation and increased annual revenues by hundreds of millions of dollars, by allowing us to find and focus on key ideas evaluated through thousands of controlled experiments. A 1% improvement to revenue equals more than $10M annually in the US, yet many ideas impact key metrics by 1% and are not well estimated a-priori. The system has also identified many negative features that we avoided deploying, despite key stakeholders' early excitement, saving us similar large amounts.

Introduction

Many web-facing companies use online controlled experiment s to guide product development and prioritize ideas, including Amazon [1], eBay, Etsy [2], Facebook, Google [3], Groupon, Intuit [4], LinkedIn, Microsoft [5], Netflix [6], Shop Direct [7], StumbleUpon [8], Yahoo, and Zynga [9]. Controlled experiments are especially useful in combination with agile development, Steve Blank’s Customer Development process [10], and MVPs (Minimum Viable Products) popularized by Eric Ries’s Lean Startup [11]. In a “Lean Startup " approach, “businesses rely on validated learning, scientific experimentation, and iterative product releases to shorten product development cycles, measure progress, and gain valuable customer feedback” [12].

Large scale can have multiple dimensions, including the number of users and the number of experiments. We are dealing with Big Data and must scale on both dimensions: each experiment typically exposes several million users to a treatment, and over 200 experiments are running concurrently. While running online controlled experiment s requires a sufficient number of users, teams working on products with thousands to tens of thousands of users (our general guidance is at least thousands of active users) are typically looking for larger effects, which are easier to detect than the small effects that large sites worry about. For example, to increase the experiment sensitivity (detectable effect size) by a factor of 10, say from 5% delta to 0.5%, you need 102=100 times more users. Controlled experiments thus naturally scale from small startups to the largest of web sites. Our focus in this paper is on scaling the number of experiments: how can organizations evaluate more hypotheses, increasing the velocity of validated learnings [11], per time unit.

We share our experiences, how we addressed challenges, and key lessons from having run thousands of online controlled experiment s at Bing, part of Microsoft’s Online Services Division. Microsoft’s different divisions use different development methodologies. Office and Windows follow Sinofsky’s long planning and execution cycles [13]. Bing has thousands of developers, program managers, and testers, using online controlled experiment s heavily to prioritize ideas and decide which changes to ship to all users. Bing’s Experimentation System is one of the largest in the world, and pushes the envelope on multiple axes, including culture, engineering, and trustworthiness. In the US alone, it distributes traffic from about 100 million monthly users executing over 3.2B queries a month [14] to over 200 experiments running concurrently. Almost every user is in some experiment: 90% of users eligible for experimentation (e.g., browser supports cookies) are each rotated into over 15 concurrent experiments, while 10% are put into a holdout group to assess the overall impact of the Experimentation System and to help with alerting.

Analysis of an experiment utilizing 20% of eligible users (10% control, 10% treatment) over 2 weeks processes about 4TB of data to generate a summary scorecard. With about 5 experiments in each one of 15 concurrent experimentation areas (conservative numbers), users end up in one of 515 ≈ 30 billion possible variants of Bing. Automated analyses, or scorecards, are generated on clusters consisting of tens of thousands of machines [1] to help guide product releases, to shorten product development cycles, measure progress, and gain valuable customer feedback. Alerts fire automatically when experiments hurt the user experience, or interact with other experiments. While the overall system has significant costs associated with it, its value far outweighs those costs: ideas that were implemented by small teams, and were not even prioritized high by the team implementing them, have had surprisingly large effects on key metrics. For example, two small changes, which took days to develop, each increased ad revenue by about $100 million annually [16].

…

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2013 OnlineControlledExperimentsatLa	Ron Kohavi Brian Frasca Toby Walker Alex Deng Ya Xu Nils Pohlmann			Online Controlled Experiments at Large Scale				10.1145/2487575.2488217		2013