Filtered Web Text Corpus
(Redirected from filtered web text corpus)
Jump to navigation
Jump to search
A Filtered Web Text Corpus is a web text corpus that undergoes quality filtering to remove low-quality content for language model training.
- AKA: Filtered Web Corpus, Cleaned Web Text, Quality-Filtered Web Data, Processed Web Corpus, Curated Web Text.
- Context:
- It can typically apply quality metrics through perplexity scores and coherence measures.
- It can typically remove spam content through spam detection and duplicate removal.
- It can typically exclude inappropriate content through safety filters and content moderation.
- It can often maintain language consistency through language detection and code removal.
- It can often preserve semantic diversity through topic analysis and domain balancing.
- It can often enable quality assurance through automated checks and sampling validation.
- It can range from being a Lightly-Filtered Web Corpus to being a Heavily-Filtered Web Corpus, depending on its filter stringency.
- It can range from being a Small Filtered Corpus to being a Large Filtered Corpus, depending on its corpus size.
- It can range from being a Single-Source Filtered Corpus to being a Multi-Source Filtered Corpus, depending on its source diversity.
- It can range from being a Domain-Specific Filtered Corpus to being a General Filtered Corpus, depending on its content scope.
- ...
- Examples:
- Reddit-Filtered Corpuses, such as:
- Quality-Processed Corpuses, such as:
- ...
- Counter-Examples:
- Raw Web Crawl, which contains unfiltered data with spam and duplicates.
- Manually Curated Text, which uses human selection rather than algorithmic filters.
- Social Media Firehose, which has unfiltered stream without quality control.
- See: Web Text Corpus, Text Corpus, OpenAI WebText Dataset, Common Crawl Dataset, Colossal Clean Crawled Corpus, Data Filtering, Language Model Training Dataset, Web Scraping.