Very Large Text Corpora
(Redirected from very large corpora)
Jump to navigation
Jump to search
A Very Large Text Corpora is a large corpora that is a very large dataset (based on a corpus size measure).
- AKA: Massive Text Collection, Big Text Data.
- Context:
- It can typically contain billions of tokens or terabytes of text data.
- It can often require distributed processing and specialized infrastructure.
- It can range from being a Web-Scale Corpus to being a Domain-Specific Large Corpus, depending on its data source.
- ...
- Example(s):
- Web Corpora, such as:
- Multilingual Corpora, such as:
- OpenSubtitles with parallel text in 60+ languages.
- Wikipedia Dumps across multiple languages.
- Historical Archives, such as:
- ...
- Counter-Example(s):
- Small Corpus, which contains less than 1 million words.
- Medium-Sized Corpus, typically 1-100 million words.
- Specialized Corpus, which prioritizes domain coverage over size.
- See: Text Corpus, Big Data, Corpus Linguistics, Web Crawling.