Hadoop Streaming Utility

From GM-RKB
(Redirected from Hadoop Streaming)
Jump to navigation Jump to search

A Hadoop Streaming Utility is a Hadoop utility is an API.



References

2011

  • http://hadoop.apache.org/mapreduce/docs/current/streaming.html
    • Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer. For example: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input myInputDirs -output myOutputDir -mapper /bin/cat -reducer /bin/wc

      How Streaming Works

      In the above example, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a MapReduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.

      When an executable is specified for mappers, each mapper task will launch the executable as a separate process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines and feed the lines to the stdin of the process. In the meantime, the mapper collects the line-oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value. If there is no tab character in the line, then entire line is considered as key and the value is null. However, this can be customized, as discussed later.

      When an executable is specified for reducers, each reducer task will launch the executable as a separate process then the reducer is initialized. As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the stdin of the process. In the meantime, the reducer collects the line-oriented outputs from the stdout of the process, converts each line into a key/value pair, which is collected as the output of the reducer. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value. However, this can be customized, as discussed later.

      This is the basis for the communication protocol between the MapReduce framework and the streaming mapper/reducer.

      You can supply a Java class as the mapper and/or the reducer. The above example is equivalent to: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input myInputDirs -output myOutputDir -mapper org.apache.hadoop.mapred.lib.IdentityMapper -reducer /bin/wc

      User can specify stream.non.zero.exit.is.failure as true or false to make a streaming task that exits with a non-zero status to be Failure or Success respectively. By default, streaming tasks exiting with non-zero status are considered to be failed tasks.

2010

  • http://developer.yahoo.com/hadoop/tutorial/module4.html#streaming
    • Whereas Pipes is an API that provides close coupling between C++ application code and Hadoop, Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations.

      Hadoop Streaming allows you to use arbitrary programs for the Mapper and Reducer phases of a MapReduce job. Both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on stdout.

      Input and output are always represented textually in Streaming. The input (key, value) pairs are written to stdin for a Mapper or Reducer, with a 'tab' character separating the key from the value. The Streaming programs should split the input on the first tab character on the line to recover the key and the value. Streaming programs write their output to stdout in the same format: key \t value \n.

      The inputs to the reducer are sorted so that while each line contains only a single (key, value) pair, all the values for the same key are adjacent to one another.

      Running a Streaming Job: To run a job with Hadoop Streaming, use the following command:

      $ bin/hadoop jar contrib/streaming/hadoop-version-streaming.jar

      The command as shown, with no arguments, will print some usage information. An example of how to run real commands is given below:

      $ bin/hadoop jar contrib/streaming-hadoop-0.18.0-streaming.jar -mapper myMapProgram -reducer myReduceProgram -input /some/dfs/path -output /some/other/dfs/path

      This assumes that myMapProgram and myReduceProgram are present on all nodes in the system ahead of time. If this is not the case, but they are present on the node launching the job, then they can be "shipped" to the other nodes with the -file option:

      $ bin/hadoop jar contrib/streaming-hadoop-0.18.0-streaming.jar -mapper myMapProgram -reducer myReduceProgram -file myMapProgram -file myReduceProgram -input some/dfs/path \

      -output some/other/dfs/path

      Any other support files necessary to run your program can be shipped in this manner as well.

2009