- (Chamandy et al., 2015) ⇒ Nicholas Chamandy, Omkar Muralidharan, and Stefan Wager. (2015). “Teaching Statistics at Google-Scale.” In: The American Statistician, 69(4).
Subject Headings: Applied Statistics.
Modern data and applications pose very different challenges from those of the 1950s or even the 1980s. Students contemplating a career in statistics or data science need to have the tools to tackle problems involving massive, heavy-tailed data, often interacting with live, complex systems. However, despite the deepening connections between engineering and modern data science, we argue that training in classical statistical concepts plays a central role in preparing students to solve Google-scale problems. To this end, we present three industrial applications where significant modern data challenges were overcome by statistical thinking.
Technology companies like Google generate and consume data on a staggering scale. Massive, distributed data present novel and interesting challenges for the statistician, and have spurred much excitement among students, and even a new discipline: Data Science. Hal Varian famously quipped in 2009 that Statistician would be "the sexy job in the next 10 years" (Lohr, 2009), a claim seemingly backed up by the proliferation of job postings for data scientists in high-tech. The McKinsey Global Institute took a more urgent tone in their 2011 report examining the explosion of big data in industry (Manyika et al., 2011). While extolling the huge productivity gains that untapping such data would bring, they predicted a shortage of hundreds of thousand of "people with deep analytical skills ", and millions of data-savvy managers, over the next few years.
Massive data present great opportunities for a statistician. Estimating tiny experimental effect sizes becomes routine, and practical significance is often more elusive than mere statistical significance. Moreover, the power of approaches that pool data across observations can be fully realized. But these exciting opportunities come with strings attached. The data sources and structures that a graduating statistician (or data scientist) faces are unlike those of the 1950s, or even the 1980s. Statistical models we take for granted are sometimes out of reach: constructing a matrix of dimensions n by p can be pure fantasy, and outliers are the rule, not the exception. Moreover, powerful computing tools have become a prerequisite to even reading the data. Modern data are in general unwieldy and raw, often contaminated by `spammy' or machine-generated observations. More than ever, data checking and sanitization are the domain of the statistician.
In this context, some might think that the future of data science education lies in engineering departments, with a focus on building ever more sophisticated data-analysis systems. Indeed, the American Statistical Association Guidelines Workgroup noted the increased importance of data science" as the leading Key Point of its 2014 Curriculum Guidelines (ASA, 2014). As Diane Lambert and others have commented, it is vital that today's statisticians have the ability to \ think with data" (Hardin et al., 2014). We agree wholeheartedly with the notion that students must be fluent in modern computational paradigms and data manipulation techniques. We present a counter-balance to this narrative, however, in the form of three data analysis challenges inspired by an industrial `big data' problem: click cost estimation. We illustrate how each example can be tackled not with fancy computation, but with new twists on standard statistical methods, yielding solutions that are not only principled, but also practical.
Our message is not at odds with the ASA's recent recommendations; indeed, the guidelines highlight statistical theory, \ exible problem solving skills ", and \ problems with a substantive context " as core components of the curriculum (ASA, 2014). The methodological tweaks presented in this article are not particularly advanced, touching on wellknown results in the domains of resampling, shrinkage, randomization and causal inference. Their contributions are more conceptual than theoretical. As such, we believe that each example and solution would be accessible to an undergraduate student in a statistics or data science program. In relaying this message to such a student, it is not the specific examples or solutions presented here that should be stressed. Rather, we wish to emphasize the value of a solid understanding of classical statistical ideas as they apply to modern problems in preparing tomorrow's students for large-scale data science challenges.
In many cases, the modern statistician acts as an interface between the raw data and the consumer of those data. The term `consumer' is used rather broadly here. Traditionally it would include the key decision-makers of a business, and perhaps the scientific research community at large. Increasingly, computerized `production' automated decision engines that are critical to a company's success may also be counted among the consumers. Leaning on this framing, we see that difficulties may arise in all three phases of data science: the input phase, the inference phase, and the output phase. Section 2 highlights some challenges posed by the input or data retrieval phase. We describe a common paradigm for data reduction and some of the constraints it imposes, and then show how we can overcome the resulting engineering difficulties with a targeted use of the bootstrap. In Section 3 we present a concrete modeling problem made more difficult by the scale of the data, and discuss how to judiciously simplify the problem to alleviate the computational burden without needlessly reducing the quality of inference. Section 4 considers the output phase specifically, the situation where the output of a statistical model is piped directly into a production system. We frame this in terms of a causal inference paradigm, highlight a problem which may result, and propose a remedy. In addition to providing evidence for the value of statistical analysis in modern data science, we also hope that these examples can provide students with an idea of the kind of problems that modern industrial statisticians may be confronted with.
2 Sharding, MapReduce, and the Data Cube
Many new challenges in statistics arise from the size and structure of modern data sources. Here we focus on difficulties caused by a sharded architecture, where pieces of the data are stored on many different machines (or shards), often in a manner determined by system efficiency rather than ease of analysis. Such architectures usually arise when we need to work with data sources that are too large to be practically handled on a single machine. In practice, the sharding happens far upstream of any data analysis; the statistician typically has no control over which pieces of data are stored on which machines.
One common procedure for reducing massive data to a smaller form amenable to inference is the MapReduce framework (Dean and Ghemawat, 2004). A detailed description of MapReduce is beyond the scope of this paper; however we give a brief overview]] in Section 2.1 below.
At a high level, MapReduce lets us create (multi-way) histograms of our data. This computational architecture makes computing some functionals of the data easy, while other functionals are more difficult to get. For example, if x 2 R100, then computing E [ x ] is easy, but computing the number of pairs of observations that are exact duplicates of each other is hard. As statisticians, we need to express the primitives on which we base our inferential procedures in terms of queries that are feasible with MapReduce.
2.1 Computing with MapReduce
Here, we follow the exposition of Chamandy et al. (2012), who consider a basic instance of MapReduce; reproduction of a stylized diagram of the system is provided in Figure 1. The internals of MapReduce are less important to this discussion than the sort of data that the program takes as input and spits out at the other end as output. The input can in general be petabytes (1015B) of sharded, unstructured or hierarchical data objects (such as machine-generated log files). The output is typically a flattened table of key-value tuples of aggregated statistics.
3 The Long Tail
Large data sets are notorious for having long-tailed occurrence distributions: many units occur infrequently, but together, these infrequent units account for a large fraction of total events. For example, Google advertisers have collectively input millions of keywords into their advertising campaigns. These keywords are a they allow the advertiser to indicate a desire to show an ad when the user's search query contains the specified terms. But most Google ad keywords are only seen a small number of times. Because of this, when doing statistical analysis of such long-tailed data, the number of units of interest scales with the amount of data collected. This creates computational challenges for traditional statistical methods.
3.1 Example: Cost per click modeling
Suppose we are interested in predicting the average cost per ad click on each keyword. Such a prediction model might serve a number of purposes: classifying query strings into commerciality bins, estimating the revenue that will be generated by a new ad campaign, etc. Some of these uses may entail memory or computational constraints, for example if the model will be used as part of an online serving system.
4 Statistical Feedback
In traditional applications, both academic and industrial, statistical models or inference procedures tend to be of a `throw-away' nature. A model is fit and a parameter estimated, or a decision is made under uncertainty, and the result is communicated to the relevant stakeholders. Typically, the statistician's job ends there, and the model or inference is unlikely to be revisited except if improvements to the methodology are desired, or in rare follow-up or meta analyses.
In this section we describe a scenario where the output of a sequentially-run statistical procedure is at risk of being used by one or more of its consumers in a way that could be damaging to the inference itself. Such use is typically not malicious, nor can it really be termed abuse; nonetheless it may have unintended side-effects.
In this paper, we described three sub-problems within the broader context of click cost estimation for paid web search. By no means have we presented an exhaustive list of methodological challenges awaiting the modern industrial statistician. Rather, we have chosen the problems to highlight a few distinct ways in which modern data and applications differ from traditional statistical applications, yet can benefit from careful statistical analysis.
In the first problem, uncertainty estimation was nontrivial because the types of second-order statistics that statisticians normally take for granted can be prohibitively expensive in large data streams. The second problem, which centered around building a finegrained prediction model, explored the gains that are possible by judiciously chopping o� the long tail of a massive data set. Finally, Example 3 highlighted the importance of protecting one's methods from unexpected uses, in an environment where product decisions driven by statistical models can feed back inconspicuously into the inference procedure. All three problems initially appeared to be straightforward and amenable to standard methods, until a technical wrinkle presented itself. In each case, it was primarily through careful statistical reasoning, and not by designing some heavyweight engineering machinery, that we were able to arrive at a solution.
In doing so, we drew upon ideas that have been around for at least four decades|and in some cases much longer. Over that time, these methods have been taught to undergraduate students with an eye to small data problems. It is increasingly important, in light of the rapid growth of available data sources, to modernize classroom examples so that graduating students are not shell-shocked by their first encounter with big data. Likewise, data-based simulation techniques should be emphasized as a proving-ground for methodology before applying it na?�vely to a billion data points. Early exposure to tools like MapReduce can help build confidence, but frankly, a newly-minted statistician with \ deep analytical skills " should have no trouble learning such things on the job. As we have argued, it remains as critical as ever that we continue to equip students with classical techniques, and that we teach each and every one of them to think like a statistician.
|2015 TeachingStatisticsatGoogleScale||Nicholas Chamandy|
|Teaching Statistics at Google-Scale||2015|