Google n-Grams Dataset

A Google n-Grams Dataset is an word n-gram frequency dataset produced by Google Inc..

Context:
- It can be composed of Google n-Grams Records, such as:
  - serve as the incoming; 92
  - serve as the incubator; 99
  - serve as the independent; 794
  - serve as the index; 223
  - …
See: Word N-gram Model.

References

2006

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13
- Item Name: Web 1T 5-gram Version 1
- Authors: Thorsten Brants, Alex Franz
- LDC Catalog No.: LDC2006T13
- ISBN: 1-58563-397-6
- Release Date: Sep 19, 2006
- Introduction: This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. We expect this data will be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.
- Source Data: The n-gram counts were generated from approximately 1 trillion word tokens of text from publicly accessible Web pages.
- Character Encoding: The input encoding of documents was automatically detected, and all text was converted to UTF8.
- Tokenization: The data was tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following:
  - Hyphenated word are usually separated, and hyphenated numbers usually form one token.
  - Sequences of numbers separated by slashes (e.g. in dates) form one token.
  - Sequences that look like urls or email addresses form one token.
- Data Sizes
  - File sizes: approx. 24 GB compressed (gzip'ed) text files
  - Number of tokens: 1,024,908,267,229
  - Number of sentences: 95,119,665,584
  - Number of unigrams: 13,588,391
  - Number of bigrams: 314,843,401
  - Number of trigrams: 977,069,902
  - Number of fourgrams: 1,313,818,354
  - Number of fivegrams: 1,176,470,663

2006a

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
- All Our N-gram are Belong to You
- Thursday, August 03, 2006 at 8/03/2006 11:26:00 AM
- Posted by Alex Franz and Thorsten Brants, Google Machine Translation Team

Google n-Grams Dataset

References

2006

2006a

Navigation menu

Search