Domain-Specific Document Corpus
(Redirected from domain-specific document corpus)
Jump to navigation
Jump to search
A Domain-Specific Document Corpus is a specialized field-focused document corpus from a specific knowledge domain.
- AKA: Domain-Specific Corpus, Specialized Corpus, Field-Specific Corpus, Subject-Specific Corpus, Vertical Domain Corpus, Topical Corpus.
- Context:
- It can typically contain Domain-Specific Terminology through technical vocabulary and specialized jargon.
- It can typically support Domain-Specific NLP Tasks through specialized language models and domain-tuned algorithms.
- It can typically enable Domain Knowledge Extraction through specialized pattern recognition and domain concept mining.
- It can typically facilitate Domain-Specific Research through targeted text analysis and specialized corpus query.
- It can typically provide Domain Language Model Training through specialized training data and domain feature extraction.
- ...
- It can often require Domain Expert Annotation through specialized tagging schemes and domain-specific guidelines.
- It can often undergo Domain-Specific Preprocessing through specialized tokenization and domain text normalization.
- It can often support Domain Ontology Development through concept extraction processes and relation identification.
- It can often enable Domain-Specific Benchmarks through specialized evaluation metrics and domain performance measures.
- ...
- It can range from being a Narrow Domain-Specific Document Corpus to being a Broad Domain-Specific Document Corpus, depending on its domain scope coverage.
- It can range from being a Single-Source Domain-Specific Document Corpus to being a Multi-Source Domain-Specific Document Corpus, depending on its domain data source diversity.
- It can range from being a Historical Domain-Specific Document Corpus to being a Contemporary Domain-Specific Document Corpus, depending on its domain temporal coverage.
- It can range from being a Regional Domain-Specific Document Corpus to being a Global Domain-Specific Document Corpus, depending on its domain geographic scope.
- It can range from being a Homogeneous Domain-Specific Document Corpus to being a Heterogeneous Domain-Specific Document Corpus, depending on its domain text type variety.
- It can range from being a Static Domain-Specific Document Corpus to being a Growing Domain-Specific Document Corpus, depending on its domain content update policy.
- ...
- It can be developed through a Domain Corpus Creation Task using domain document collection and domain text curation.
- It can be annotated through a Domain-Specific Annotation Task using domain annotation tools and domain tagging protocols.
- It can be validated through a Domain Corpus Validation Task using domain coverage metrics and domain quality measures.
- It can be maintained through a Domain Corpus Management Task using domain update procedures and domain quality control.
- ...
- Example(s):
- Legal Domain-Specific Document Corpuses, such as:
- Medical Domain-Specific Document Corpuses, such as:
- Financial Domain-Specific Document Corpuses, such as:
- Scientific Domain-Specific Document Corpuses, such as:
- Computer Science Corpuses, such as:
- Physics Research Corpuses, such as:
- Technical Domain-Specific Document Corpuses, such as:
- ...
- Counter-Example(s):
- Open-Domain Corpus, such as Common Crawl, which lacks domain focus.
- General-Purpose Corpus, such as British National Corpus, which covers multiple domains without specialization.
- Domain-Specific Knowledge Base, which contains structured domain knowledge rather than raw domain text.
- Domain Dictionary, which provides term definitions rather than document collections.
- Random Web Sample, which lacks domain coherence and topical focus.
- See: Domain-Specific NLP, Domain Adaptation, Specialized Language Model, Domain Ontology, Technical Text Analysis, Domain Knowledge Extraction, Vertical Search Engine, Specialized Information Retrieval, Domain-Specific Benchmark, Professional Language Corpus.
References
2010
- (Melli, 2010a) ⇒ Gabor Melli. (2010). “Concept Mentions within KDD-2009 Abstracts (kdd09cma1) Linked to a KDD Ontology (kddo1)." In: Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 2010).
- QUOTE: The kdd09cma1 corpus is based on the 139 abstracts of the papers accepted for ACM's SIGKDD annual conference in 2009 (KDD 2009) that are freely accessible from ACM's Digital Library [1]. KDD is a competitive peer-reviewed conference with acceptance rates in the range of 20% -25%. The conference topic is data mining and knowledge discovery from databases.
The abstracts were manually annotated by the author for concept mentions. We define a concept mention to be a sequence of tokens (orthographic words and punctuation) whose meaning is deemed by an expert to be used within their community of speakers, and whose meaning is not necessarily well understood by a member of the general public. Often concept mentions are words (terminological units), but not always. The mentions can also be phrases. For example the phrase “problem of web classification” could be identified as a mention of the
Web_Object Classification_Task
concept.
- QUOTE: The kdd09cma1 corpus is based on the 139 abstracts of the papers accepted for ACM's SIGKDD annual conference in 2009 (KDD 2009) that are freely accessible from ACM's Digital Library [1]. KDD is a competitive peer-reviewed conference with acceptance rates in the range of 20% -25%. The conference topic is data mining and knowledge discovery from databases.