OpenAI Common Crawl Dataset
(Redirected from OpenAI CC Dataset)
Jump to navigation
Jump to search
A OpenAI Common Crawl Dataset is a Common Crawl dataset that undergoes OpenAI preprocessing for large-scale language model training by OpenAI, Inc..
- AKA: OpenAI Common Crawl, OpenAI Web Crawl, OpenAI CC Dataset, OpenAI Preprocessed Web Data, OpenAI Common Crawl Snapshot.
- Context:
- It can typically contain 200GB preprocessed web data from Common Crawl snapshots with quality filtering.
- It can typically include text extractions through HTML parsing and boilerplate removal.
- It can typically provide deduplicated content through MinHash algorithms and exact matching.
- It can often support foundation model training through web-scale text data and diverse domains.
- It can often enable transfer learning through cross-domain content and multilingual text.
- It can often facilitate representation learning through semantic diversity and contextual variety.
- It can range from being a Small Common Crawl Sample to being a Full Common Crawl Snapshot, depending on its data size.
- It can range from being a Lightly-Processed Common Crawl to being a Heavily-Processed Common Crawl, depending on its filtering depth.
- It can range from being a Single-Snapshot Common Crawl to being a Multi-Snapshot Common Crawl, depending on its temporal coverage.
- It can range from being a General Common Crawl to being a Domain-Filtered Common Crawl, depending on its content selection.
- ...
- Examples:
- OpenAI CC-200GB, the 200 gigabyte dataset under CC0 license.
- OpenAI Common Crawl Subsets, such as:
- ...
- Counter-Examples:
- Raw Common Crawl, which lacks OpenAI processing and quality filters.
- C4 Corpus, which uses different filtering pipeline by Google Research.
- OpenAI WebText Dataset, which uses Reddit filtering rather than broad web crawling.
- See: Common Crawl Dataset, Common Crawl Foundation, OpenAI Platform Dataset Collection, Web Crawl Dataset, Large-Scale Dataset, Colossal Clean Crawled Corpus, Web Data Commons, Language Model Training Dataset.