2005 ProminerRuleBasedProteinGeneNER

Jump to: navigation, search

Subject Headings: ProMiner System, Protein Mention Recognition Task, Protein Mention Normalization Task.


Cited By




Identification of gene and protein names in biomedical text is a challenging task as the corresponding nomenclature has evolved over time. This has led to multiple synonyms for individual genes and proteins, as well as names that may be ambiguous with other gene names or with general English words. The Gene List Task of the BioCreAtIvE challenge evaluation enables comparison of systems addressing the problem of protein and gene name identification on common benchmark data.


The ProMiner system uses a pre-processed synonym dictionary to identify potential name occurrences in the biomedical text and associate protein and gene database identifiers with the detected matches. It follows a rule-based approach and its search algorithm is geared towards recognition of multi-word names [1]. To account for the large number of ambiguous synonyms in the considered organisms, the system has been extended to use specific variants of the detection procedure for highly ambiguous and case-sensitive synonyms. Based on all detected synonyms for one abstract, the most plausible database identifiers are associated with the text. Organism specificity is addressed by a simple procedure based on additionally detected organism names in an abstract.


The extended ProMiner system has been applied to the test cases of the BioCreAtIvE competition with highly encouraging results. In blind predictions, the system achieved an F-measure of approximately 0.8 for the organisms mouse and fly and about 0.9 for the organism yeast.


  • Hanisch D, Fluck J, Mevissen HT, Zimmer R: Playing biology's name game: identifying protein names in scientific text. Pacific Symposium on Biocomputing 2003, 403-14.
  • Jenssen T, Lagreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics 2001, 28:21.
  • Fukada K, Tamura A, Tsunoda T, Takagi T: Toward information extraction: identifying protein names from biological papers. Pacific Symposium on Biocomputing 1998, 701.
  • Proux D, Rechenmann F, Julliard L, Pillet V, Jacq B: Detecting Gene Symbols and Names in Biological Texts: a first step toward pertinent information extraction. Genome Informatics Workshop 1998, 72-80.
  • Collier N, No C, Jun'ichi Tsujii: Extracting the names of genes and gene products with a Hidden Markov Model. Proc COLING 2000 2000, 201-207.
  • Lee KJ, Hwang YS, Rim HC: Two-Phase Biomedical NE Recognition based on SVMs. [1] In: Proceedings of the ACL 2003 Workshop on Natural Language # Wilbur WJ, Tanabe L: GENETAG: A tagged Corpus for Gene/Protein Named Entity Recognition. BMC Bioinformatics 2005, 6(Suppl 1):S3.
  • Yeh A, Morgan A, Marc E. Colosimo, Lynette Hirschman: BioCreAtIvE task 1a: gene mention finding evaluation. BMC Bioinformatics 2005, 6(Suppl 1):S2.
  • Hirschmann L, Marc E. Colosimo, Morgan A, Yeh A: Overview of BioCreAtIvE task 1B: Normalized Gene Lists.
  • Krauthammer M, Rzhetsky A, Morozov P, Friedmann C: Using BLAST for identifying gene and protein names in journal articles. Gene 2000, 259:245.
  • Fundel K, Güttler D, Zimmer R, Apostolakis J: A simple approach for protein name identication: prospects and limits. BMC Bioinformatics 2005, 6(Suppl 1):S15.
  • Porter M: An algorithm for suffix stripping. Program 1980, 14(3):130-137.
  • Chang J, Schütze H, Altman R: Creating an Online Dictionary of Abbreviations from MEDLINE. The Journal of the American Medical Informatics Association 2002, 9(6):612-620.
  • Schwartz AS, Hearst MA: Identifying Abbreviation Definitions in Biomedical Text. Pacific Symposium on Biocomputing 2003, 451-462.
  • Tamames J: Text Detective: BioAlma's gene annotation tool. BMC Bioinformatics 2005, 6(Suppl 1):S10.
  • Crim J, McDonald R, Pereira F: Automatically Annotating Documents with Normalized Gene Lists. BMC Bioinformatics 2005, 6(Suppl 1):S13.
  • Marc E. Colosimo, Morgan A, Yeh A, Colombe J, Hirschmann L: Data Preparation and Interannotator Agreement: BioCreAtIvE Task 1B. BMC Bioinformatics 2005, 6(Suppl 1):S12.


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 ProminerRuleBasedProteinGeneNERDaniel Hanisch
Katrin Fundel
Heinz-Theodor Mevissen
Ralf Zimmer
Juliane Fluck
Prominer: Rule-based protein and gene entity recognitionBMC Bioinformaticshttp://www.biomedcentral.com/1471-2105/6/S1/S142005