2007 PracticalGuidetoControlledExper

From GM-RKB
(Redirected from Kohavi et al., 2007)
Jump to navigation Jump to search

Subject Headings: Controlled Experiments, A/B Testing, E-commerce

Notes

Cited By

Quotes

Author Keywords

Controlled Experiments, A/B Testing, E-commerce

Abstract

The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments (single factor or factorial designs), A/B tests (and their generalizations), split tests, Control/Treatment tests, and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. We provide a practical guide to conducting online experiments, where end-users can help guide the development of features. Our experience indicates that significant learning and return-on-investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person's Opinion (HiPPO). We provide several examples of controlled experiments with surprising results. We review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). We focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction. We describe common architectures for experimentation systems and analyze their advantages and disadvantages. We evaluate randomization and hashing techniques, which we show are not as simple in practice as is often assumed. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements. Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on our extensive practical experience with multiple systems and organizations, we share key lessons that will help practitioners in running trustworthy controlled experiments.

3. Controlled Experiments

3.1 Terminology

The terminology for controlled experiments varies widely in the literature. Below we define key terms used in this paper and note alternative terms that are commonly used.

Overall Evaluation Criterion (OEC) (Roy, 2001). A quantitative measure of the experiment’s objective. In statistics this is often called the Response or Dependent Variable (15; 16); other synonyms include Outcome, Evaluation metric, Performance metric, or Fitness Function (22). Experiments may have multiple objectives and a scorecard approach might be taken (29), although selecting a single metric, possibly as a weighted combination of such objectives is highly desired and recommended (Roy, 2001 p. 50). A single metric forces tradeoffs to be made once for multiple experiments and aligns the organization behind a clear objective. A good OEC should not be short-term focused (e.g., clicks); to the contrary, it should include factors that predict long-term goals, such as predicted lifetime value and repeat visits. Ulwick describes some ways to measure what customers want (although not specifically for the web) (30).

Factor. A controllable experimental variable that is thought to influence the OEC. Factors are assigned Values, sometimes called Levels or Versions. Factors are sometimes called Variables. In simple A/B tests, there is a single factor with two values: A and B.

Variant. A user experience being tested by assigning levels to the factors; it is either the Control or one of the Treatments. Sometimes referred to as Treatment, although we prefer to specifically differentiate between the Control, which is a special variant that designates the existing version being compared against and the new Treatments being tried. In case of a bug, for example, the experiment is aborted and all users should see the Control variant.

Experimentation Unit. The entity on which observations are made. Sometimes called an item. The units are assumed to be independent. On the web, the user is the most common experimentation unit, although some experiments may be done on sessions or page views. For the rest of the paper, we will assume that the experimentation unit is a user. It is important that the user receive a consistent experience throughout the experiment, and this is commonly achieved through cookies.

Null Hypothesis. The hypothesis, often referred to as [math]\displaystyle{ H_0 }[/math], that the OECs for the variants are not different and that any observed differences during the experiment are due to random fluctuations. Confidence level. The probability of failing to reject (i.e., retaining) the null hypothesis when it is true. Power. The probability of correctly rejecting the null hypothesis, H0, when it is false. Power measures our ability to detect a difference when it indeed exists.

A/A Test. Sometimes called a Null Test (19). Instead of an A/B test, you exercise the experimentation system, assigning users to one of two groups, but expose them to exactly the same experience. An A/A test can be used to (i) collect data and assess its variability for power calculations, and (ii) test the experimentation system (the Null hypothesis should be rejected about 5% of the time when a 95% confidence level is used).

Standard Deviation (Std-Dev). A measure of variability , typically denoted by 𝜎.

Standard Error (Std-Err). For a statistic, it is the standard deviation of the sampling distribution of the sample statistic (15). For a mean of 𝑛 independent observations, it is [math]\displaystyle{ \hat{\sigma} / \sqrt{n} }[/math] where [math]\displaystyle{ \sigma }[/math] is the estimated standard deviation.

5. Lessons Learned

5.2 Trust and Execution

5.2.4 Assign 50% of Users to Treatment

... Assuming all factors are fixed, a good approximation for the multiplicative increase in running time for an A/B test relative to 50%/50% is 1/(4𝑝(1−𝑝)) where the treatment receives portion 𝑝 of the traffic. For example, if an experiment is run at 99%/1%, then it will have to run about 25 times longer than if it ran at 50%/50%.

References

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 PracticalGuidetoControlledExperRon Kohavi
Dan Sommerfield
Randal M. Henne
Practical Guide to Controlled Experiments on the Web: Listen to Your Customers Not to the Hippo10.1145/1281192.1281295