A text classification ask is a classification task that is restricted to the mapping of text items into one or more predefined text categories.
References
2009
- (Wikipedia, 2009) http://en.wikipedia.org/wiki/Document_classification
- Document classification/categorization is a problem in information science. The task is to assign an electronic document to one or more categories, based on its contents. Document classification tasks can be divided into two sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, and unsupervised document classification, where the classification must be done entirely without reference to external information. There is also a semi-supervised document classification, where parts of the documents are labeled by the external mechanism.
2008
- (Manning & al, 2008) => Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. (2008). "Introduction to Information Retrieval." Cambridge University Press. ISBN 0521865719 (alternate, search).
- The text classification problem http://nlp.stanford.edu/IR-book/html/htmledition/the-text-classification-problem-1.html
- In text classification, we are given a description d in
X of a document, where X is the document space ; and a fixed set of classes C = {c1,c2,...,cJ}. Classes are also called categories or labels. Typically, the document space X is some type of high-dimensional space, and the classes are human defined for the needs of an application... We are given a training set D of labeled documents <d,c>, where <d,c> in X x C. - Our goal in text classification is high accuracy on test data or new data ... When we use the training set to learn a classifier for test data, we make the assumption that training data and test data are similar or from the same distribution.
2007
2006
- (Ruch, 2006) => Patrick Ruch. (2006). "Automatic Assignment of Biomedical Categories: toward a generic approach." In: Bioinformatics, 2006 Mar 15. [doi>10.1093/bioinformatics/bti783].
- To our knowledge the largest set of categories ever used by text classification systems has an order of magnitude of 104. Thus, Yang and Chute (1992) work with the International Classification of Diseases (about 12,000 concepts), while Yang (1999) and Wilbur and Yang (1996) report on experiments conducted with a search space of less than 18,000 Medical Subject Headings (MeSH). To evaluate our system, it is tested using two different benchmarks: 1) the OHSUGEN (Hersh, 2005) collection for the MeSH terminology and 2) the BioCreative data for the Gene Ontology (GO). The Gene Ontology is currently the main controlled-vocabulary for molecular biology. The MeSH is a more general glossary as it covers also medical and clinical fields, but is has been acknowledged as an important resource for text mining in the domain (Shah et al., 2003).
2002
- (Sebastiani, 2002) => Fabrizio Sebastiani. (2002). "Machine Learning in Automated Text Categorization." In: Association of Computing Machinery Computing Surveys (CSUR), 34(1).
- The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories.
1999
- (Yang, 1999) => Y. Yang. (1999). "An Evaluation of Statistical Approaches to Text Categorization." In: Journal of Information Retrieval, 1.
- experiment on a search space of less than 18,000 Medical Subject Headings (MeSH).
1998
- (Dumais & al, 1998) => Susan T. Dumais, John C. Platt, David Heckerman, and Mehran Sahami. (1998). "Inductive Learning Algorithms and Representations for Text Categorization." In: Proceedings of the Seventh International Conference on Information and Knowledge Management (CIKM 1998).
- Text categorization – the assignment of natural language texts to one or more predefined categories based on their content – is an important component in many information organization and management tasks. Its most widespread application to date has been for assigning subject categories to documents to support text retrieval, routing and filtering.
1996
- (Wilbur & Yang, 1996) => J. Wilbur, and Y. Yang. (1996). "Analysis of Statistical Term Strength and its Use in the Indexing and Retrieval of Molecular Biology Texts." In: Comput. Biol. Med., 26(3), 209–222.
- experiment on a search space of less than 18,000 Medical Subject Headings (MeSH).
1992
- (Yang & Chute, 1992) => Y. Yang, and C. Chute. (1992). "A Linear Least Squares Fit Mapping Method for Information Retrieval from Natural Language Texts." In: COLING 1992.
- Work with the International Classification of Diseases (about 12,000 concepts)
1975
- (Field, 1975) => B. J. Field. (1975). "Towards Automatic Indexing: Automatic assignment of controlled-language indexing and classification from free indexing." In: Journal of Documentation, 31(4). [doi>10.1108/eb026605]
1963
- (Borko & Bernick, 1963) => Harold Borko, and Myrna Bernick. (1963). "Automatic Document Classification." In: Journal of the ACM (JACM).
- The problem of automatic document classification is a part of the larger problem of automatic content analysis. Classification means the determination of subject content. For a document to be classified under a given heading, it must be ascertained that its subject matter relates to that area of discourse. In most cases this is a relatively easy decision for a human being to make. The question being raised is whether a computer can be programmed to determine the subject content of a document and the category (categories) into which it should be classified.