2010 MapReduceaFlexibleDataProcessin

(Dean & Ghemawat, 2010) ⇒ Jeffrey Dean, and Sanjay Ghemawat. (2010). “MapReduce: A Flexible Data Processing Tool.” In: Communications of the ACM Journal, 53(1). doi:10.1145/1629175.1629198

Subject Headings: MapReduce, Data Processing System.

Notes

Cited By

Quotes

Abstract

MapReduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs.

Introduction

MAPREDUCE is a programming model for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs and a reduce function that merges all intermediate values associated with the same intermediate key. We built a system around this programming model in 2003 to simplify construction of the inverted index for handling searches at Google.com. Since then, more than 10,000 distinct programs have been implemented using MapReduce at Google, including algorithms for large-scale graph processing, text processing, machine learning, and statistical machine translation. The Hadoop open source implementation of MapReduce has been used exten- sively outside of Google by a number of organizations. 10,11

To help illustrate the MapReduce programming model, consider the problem of counting the number of occurrences of each word in a large col- lection of documents. The user would write code like the following pseudo- code:

map(String key, String value):
// key: document name
// value: document contents
for each word [math]\displaystyle{ w }[/math] in value:
EmitIntermediate(w, “1”);
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));

The map function emits each word plus an associated count of occurrences (just `1' in this simple example). The reduce function sums together all counts emitted for a particular word.

MapReduce automatically parallelizes and executes the program on a large cluster of commodity machines. The runtime system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing required inter-machine communication.

MapReduce allows programmers with no experience with parallel and distributed systems to easily utilize the resources of a large distributed system. A typical MapReduce computation processes many terabytes of data on hundreds or thousands of machines. Programmers find the system easy to use, and more than 100,000 MapReduce jobs are executed on Google’s clusters every day.

…

Performance

... The input may require a full scan over a set of files, as Pavlo et al. suggested, but alternate implementations are often used. For example, the input may be a database with an index that provides efficient filtering or an indexed file structure (such as daily log files used for efficient date-based filtering of log data).

…

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2010 MapReduceaFlexibleDataProcessin	Jeffrey Dean Sanjay Ghemawat			MapReduce: A Flexible Data Processing Tool				10.1145/1629175.1629198