2006 AutomaticAssignmentOfBioCateg

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Text Classification Task, Large Multiclass Classification Task, MeSH, GO Ontology

Notes

Cited By

Quotes

Motivation

We report on the development of a generic text categorization system designed to automatically assign biomedical categories to any input text. Unlike usual automatic text categorization systems, which rely on data-intensive models extracted from large sets of training data, our categorizer is largely data-independent.

Methods

In order to evaluate the robustness of our approach we test the system on two different biomedical terminologies: the Medical Subject Headings (MeSH) and the Gene Ontology (GO). Our lightweight categorizer, based on two ranking modules, combines a pattern matcher and a vector space retrieval engine, and uses both stems and linguistically-motivated indexing units.

Results and Conclusion

Results show the effectiveness of phrase indexing for both GO and MeSH categorization, but we observe the categorization power of the tool depends on the controlled vocabulary: precision at high ranks ranges from above 90% for MeSH to20% for GO, establishing a new baseline for categorizers based on retrieval methods.

2 Background

To our knowledge the largest set of categories ever used by text classification systems has an order of magnitude of 104. Thus, Yang and Chute (1992) work with the International Classification of Diseases (about 12,000 concepts), while Yang (1999) andWilbur and Yang (1996) report on experiments conducted with a search space of18,000 Medical Subject Headings (MeSH). To evaluate our system, it is tested using two different benchmarks: 1) the OHSUGEN (Hersh, 2005) collection for the MeSH terminology and 2) the BioCreative data for the Gene Ontology (GO). The Gene Ontology is currently the main controlled-vocabulary for molecular biology. The MeSH is a more general glossary as it covers also medical and clinical fields, but is has been acknowledged as an important resource for text mining in the domain (Shah et al., 2003).

2.1 Scalability issues

General purpose machine learning methods might be inappropriate for some automatic text categorization tasks in biomedical terminologies because reliable training data are often not available (Camon et al., 2003). To some extent, this statement can be applied to the MeSH as well: between 2004 and 2005, 487 new headings were introduced, while 60 were deleted and 129 were modified, so about two concepts are added every day. In contrast, our approach is data-poor, because it only demands a small collection of annotated texts for fine tuning the statistical model.


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2006 AutomaticAssignmentOfBioCategPatrick RuchAutomatic Assignment of Biomedical Categories: toward a generic approachhttp://dx.doi.org/10.1093/bioinformatics/bti78310.1093/bioinformatics/bti783