Document Corpus Database

A Document Corpus Database is a stable digital searchable document database in the role of being used to answer a research question.

AKA: Corpus Database, Document Set, Document Corpus, Corpus, Document Collection.
Context:
- It can typically contain Corpus Words, Corpus Sentences, and Corpus Documents from the corpus document collection.
- It can typically support Corpus-Based Tasks through corpus analysis functions and corpus search capability.
- It can typically provide Statistical Analysis Capability through corpus statistics and frequency distributions.
- It can typically enable Linguistic Research Applications through corpus annotation layers and corpus metadata.
- It can typically facilitate Machine Learning Training Processes through training data extraction and feature engineering.
- ...
- It can often be a member of a Corpora Collection that aggregates multiple related corpus databases.
- It can often undergo Corpus Annotation Processes to add linguistic tags and semantic markup.
- It can often require Corpus Management Systems for corpus administration and corpus maintenance.
- It can often support Cross-Linguistic Analysis through parallel corpus alignment and comparable corpus mapping.
- ...
- It can range from being a Small Document Corpus Database to being a Large Document Corpus Database, depending on its document corpus size.
- It can range from being an Unannotated Document Corpus Database to being an Annotated Document Corpus Database, depending on its document annotation level.
- It can range from being a Real-World Document Corpus Database to being a Synthetic Document Corpus Database, depending on its document data source.
- It can range from being a Domain-Specific Document Corpus Database to being an Open-Domain Document Corpus Database, depending on its document content scope.
- It can range from being a Monolingual Document Corpus Database to being a Multilingual Document Corpus Database, depending on its document language diversity.
- It can range from being a Sequential Document Corpus Database to being a Non-Sequential Document Corpus Database, depending on its document temporal ordering.
- It can range from being a Static Document Corpus Database to being a Dynamic Document Corpus Database, depending on its document update frequency.
- ...
- It can be created by a Corpus Creation Task through document collection processes and corpus compilation methods.
- It can be managed by a Corpus Management Task through corpus maintenance procedures and corpus quality control.
- It can be analyzed by a Corpus Analysis Task through corpus mining techniques and linguistic analysis tools.
- It can be evaluated by a Corpus Evaluation Task through corpus quality metrics and corpus coverage measures.
- ...
Example(s):
- Text-Focused Document Corpus Databases, such as:
  - Natural Language Processing Corpuses, such as:
  - Web-Based Text Corpuses, such as:
    - Common Crawl Corpus for web-scale text analysis.
    - Wikipedia Corpus for encyclopedic text research.
- Domain-Specific Document Corpus Databases, such as:
  - Legal Document Corpus Databases, such as:
    - ContractNLI Corpus by Koreeda & Manning, for contract inference tasks.
    - CUAD Dataset for contract understanding research.
  - Biomedical Document Corpus Databases, such as:
    - PubMed Corpus for medical literature analysis.
    - GENETAG Corpus for gene mention recognition.
  - News Document Corpus Databases, such as:
    - Reuters Corpus for news classification tasks.
    - AQUAINT Corpus for news-based question answering.
- Multilingual Document Corpus Databases, such as:
  - Parallel Translation Corpuses, such as:
    - Europarl Corpus for parliamentary translations.
    - UN Parallel Corpus for multilingual alignment.
- Annotated Document Corpus Databases, such as:
  - SemCor Corpus for word sense disambiguation.
  - OntoNotes Corpus for multilayer linguistic annotation.
  - ACE Corpus for entity relation extraction.
- Benchmark Document Corpus Databases, such as:
- ...
Counter-Example(s):
- The Library of Congress, which is a physical document collection rather than a digital corpus database.
- Knowledge Base, which contains structured knowledge representations rather than raw document collections.
- Random Document Sample, which lacks the systematic organization and research focus of a corpus database.
- Scientific Literature Collection, which may lack the digital accessibility and computational processability of a corpus database.
- Database Management System, which manages structured data records rather than document collections.
See: Corpus-Based Application, Corpus Linguistics, Text Mining System, Natural Language Processing, Information Retrieval System, Digital Library, Document Management System, Text Analytics Platform, Computational Linguistics, Machine Learning Dataset.

References

2011

(Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/Text_corpus
- QUOTE: In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe.
  A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora.
  In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual.
  Some corpora have further structured levels of analysis applied. In particular, a number of smaller corpora may be fully parsed. Such corpora are usually called Treebanks or Parsed Corpora. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around 1 to 3 million words. Other levels of linguistic structured analysis are possible, including annotations for morphology, semantics and pragmatics.
  Corpora are the main knowledge base in corpus linguistics. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics, speech recognition and machine translation, where they are often used to create hidden Markov models for part of speech tagging and other purposes. Corpora and frequency lists derived from them are useful for language teaching.

2009

(Yao et al., 2009) ⇒ Limin Yao, David Mimno, and Andrew McCallum. (2009). “Efficient Methods for Topic Model Inference on Streaming Document Collections.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). 10.1145/1557019.1557121
- QUOTE: … With today's large-scale, constantly expanding document collections, it is useful to be able to infer topic distributions for new documents without retraining the model.

2009

(WordNet, 2009) ⇒ http://wordnetweb.princeton.edu/perl/webwn?s=corpus
- S: (n) principal, corpus, principal sum (capital as contrasted with the income derived from it)
- S: (n) corpus (a collection of writings) "he edited the Hemingway corpus"
- …
http://en.wiktionary.org/wiki/corpus#Noun
- a collection of writings, often on a specific topic, of a specific genre, from a specific demographic, a single author etc. …

Document Corpus Database

References

2011

2009

2009

Navigation menu

Search