OpenAI WebText Dataset
(Redirected from OpenAI WebText Corpus)
Jump to navigation
Jump to search
A OpenAI WebText Dataset is a filtered web text corpus that contains Reddit-sourced text filtered through upvote thresholds by OpenAI, Inc..
- AKA: WebText, OpenAI WebText, GPT-2 Training Data, Reddit-Filtered Corpus, OpenAI WebText Corpus.
- Context:
- It can typically contain high-quality web documents extracted from Reddit outbound links with community validation.
- It can typically support GPT-2 model training with diverse text content and natural language patterns.
- It can typically exclude low-quality content through Reddit karma thresholds and spam filtering.
- It can often provide 40 gigabytes of filtered text data for unsupervised language modeling.
- It can often include long-form content from news articles, blog posts, and reference material.
- It can often enable transfer learning through domain-diverse content and linguistic variety.
- It can range from being a Small WebText Sample to being a Full WebText Dataset, depending on its data volume.
- It can range from being a Lightly-Filtered WebText to being a Heavily-Filtered WebText, depending on its quality threshold.
- It can range from being an English-Only WebText to being a Multilingual WebText, depending on its language scope.
- It can range from being a Raw WebText to being a Tokenized WebText, depending on its preprocessing level.
- ...
- Examples:
- WebText-40GB, the original 40 gigabyte corpus used for GPT-2 training.
- WebText Training Splits, such as:
- WebText-Derived Datasets, such as:
- OpenWebText as an open-source recreation.
- WebText2 as an extended version.
- ...
- Counter-Examples:
- Common Crawl, which uses indiscriminate web crawling rather than quality filtering.
- Wikipedia Corpus, which contains encyclopedia articles rather than diverse web content.
- BookCorpus, which uses published books rather than web pages.
- See: Text Corpus, Filtered Web Text Corpus, OpenAI Platform Dataset Collection, GPT-2 Large Language Model, Reddit, Language Model Training Dataset, Web Scraping, Natural Language Processing Dataset, OpenAI, Inc..