Salesforce Wikipedia WikiText Dataset

From GM-RKB
Jump to navigation Jump to search

A Salesforce Wikipedia WikiText Dataset is a Wikipedia-based large text corpus based on good and featured Wikipedia articles.



References

2017

  • https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/
    • QUOTE: The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

      Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

2016