Text Corpus
A text corpus is a corpus composed of text items.
- AKA: Unstructured Document Collection.
- Context:
- It can range from being a Very Large Text Corpus, to being a Large Text Corpus to being a Small Text Corpus.
- It can range from being a Monolingual Text Corpus (such as an English corpus), to being a Bilingual Text Corpus to being a Multilingual Text Corpus.
- It can be an Annotated Text Corpus.
- It can be an input to a Text Corpus Mining Task.
- Example(s):
- Counter-Example(s):
- See: Corpora, Text-based Semantic Annotation, Text Stream.
References
2015
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/text_corpus Retrieved:2015-4-13.
- In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/list_of_text_corpora Retrieved:2015-4-13.
- Following is a list of text corpora in various languages. "Text corpora" is the plural of "text corpus". A text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). Text corpora are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/list_of_text_corpora#English_language Retrieved:2015-4-13.
- Google N-Grams Corpus – Largest English corpus at 155 billion words. [1] Also has corpora for other languages. To download datasets of this corpus, see
- American National Corpus *Bank of English *British National Corpus *Corpus Juris Secundum.
- Corpus of Contemporary American English (COCA) 425 million words, 1990–2011. Freely searchable online.
- Brown Corpus, forming part of the "Brown Family" of corpora, together with LOB, Frown and F-LOB.
- International Corpus of English.
- Oxford English Corpus.
- Scottish Corpus of Texts & Speech
- Corpus Resource Database (CoRD), more than 80 English language corpora.
- ↑ Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at http://googlebooks.byu.edu/x.asp.
2009
- (Yao et al., 2009) ⇒ Limin Yao, David Mimno, and Andrew McCallum. (2009). “Efficient Methods for Topic Model Inference on Streaming Document Collections.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). 10.1145/1557019.1557121
- QUOTE: Topic models provide a powerful tool for analyzing large text collections by ... Fitting a topic model given a set of training documents requires ... With today's large-scale, constantly expanding document collections, it is useful to be able to infer topic distributions for new documents without retraining the model.