OpenAI WebText Dataset

From GM-RKB

(Redirected from OpenAI WebText Corpus)

Jump to navigation Jump to search

A OpenAI WebText Dataset is a filtered web text corpus that contains Reddit-sourced text filtered through upvote thresholds by OpenAI, Inc..

AKA: WebText, OpenAI WebText, GPT-2 Training Data, Reddit-Filtered Corpus, OpenAI WebText Corpus.
Context:
- It can typically contain high-quality web documents extracted from Reddit outbound links with community validation.
- It can typically support GPT-2 model training with diverse text content and natural language patterns.
- It can typically exclude low-quality content through Reddit karma thresholds and spam filtering.
- It can often provide 40 gigabytes of filtered text data for unsupervised language modeling.
- It can often include long-form content from news articles, blog posts, and reference material.
- It can often enable transfer learning through domain-diverse content and linguistic variety.
- It can range from being a Small WebText Sample to being a Full WebText Dataset, depending on its data volume.
- It can range from being a Lightly-Filtered WebText to being a Heavily-Filtered WebText, depending on its quality threshold.
- It can range from being an English-Only WebText to being a Multilingual WebText, depending on its language scope.
- It can range from being a Raw WebText to being a Tokenized WebText, depending on its preprocessing level.
- ...
Examples:
- WebText-40GB, the original 40 gigabyte corpus used for GPT-2 training.
- WebText Training Splits, such as:
  - WebText Train Set with 36GB training data.
  - WebText Validation Set with 2GB validation data.
  - WebText Test Set with 2GB test data.
- WebText-Derived Datasets, such as:
  - OpenWebText as an open-source recreation.
  - WebText2 as an extended version.
- ...
Counter-Examples:
- Common Crawl, which uses indiscriminate web crawling rather than quality filtering.
- Wikipedia Corpus, which contains encyclopedia articles rather than diverse web content.
- BookCorpus, which uses published books rather than web pages.
See: Text Corpus, Filtered Web Text Corpus, OpenAI Platform Dataset Collection, GPT-2 Large Language Model, Reddit, Language Model Training Dataset, Web Scraping, Natural Language Processing Dataset, OpenAI, Inc..

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=OpenAI_WebText_Dataset&oldid=978745"