Pig Program

From GM-RKB
Jump to navigation Jump to search

A Pig Program is a software program composed of pig statements (written in the pig programming language) that can be executed by a Pig software system.



References

2013

 input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
 -- Extract words from each line and put them into a pig bag
 -- datatype, then flatten the bag to get one word on each row
 words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
 -- filter out any words that are just white spaces
 filtered_words = FILTER words BY word MATCHES '\\w+';
 -- create a group for each word
 word_groups = GROUP filtered_words BY word;
 -- count the entries in each group
 word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
 -- order the records by count
 ordered_word_count = ORDER word_count BY count DESC;

STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

The above program will generate parallel executable tasks which can be distributed across 1,000s of machines in a Hadoop cluster to count the number of words in a dataset such as "all the webpages on the internet".