2005 SimpleApproachToProteinNER

Jump to: navigation, search

Subject Headings: ProMiner System, Protein Named Entity Recognition Task


Cited By

~50 http://scholar.google.com/scholar?cites=12724719570762652811



  • Background: Significant parts of biological knowledge are available only as unstructured text in articles of biomedical journals. By automatically identifying gene and gene product (protein) names and mapping these to unique database identifiers, it becomes possible to extract and integrate information from articles and various data sources. We present a simple and efficient approach that identifies gene and protein names in texts and returns database identifiers for matches. It has been evaluated in the recent BioCreAtIvE entity extraction and mention normalization task by an independent jury.
  • Methods: Our approach is based on the use of synonym lists that map the unique database identifiers for each gene/protein to the different synonym names. For yeast and mouse, synonym lists were used as provided by the organizers who generated them from public model organism databases. The synonym list for fly was generated directly from the corresponding organism database. The lists were then extensively curated in largely automated procedure and matched against MEDLINE abstracts by exact text matching. Rule-based and support vector machine-based post filters were designed and applied to improve precision.
  • Results: Our procedure showed high recall and precision with F-measures of 0.897 for yeast and 0.764/0.773 for mouse in the BioCreAtIvE assessment (Task 1B) and 0.768 for fly in a postevaluation.
  • Conclusion: The results were close to the best over all submissions. Depending on the synonym properties it can be crucial to consider context and to filter out erroneous matches. This is especially important for fly, which has a very challenging nomenclature for the protein name identification task. Here, the support vector machine-based post filter proved to be very effective.


  • Bunescu R, Ge R, Kate R, Mooney R, Wong Y, Marcotte E, Ramani A: Learning to Extract Proteins and their Interactions from Medline Abstracts. Proceedings of ICML-2003 Workshop on Machine Learning in Bioinformatics 2003:46-53.
  • Chang JT, Schutze H, Altman RB: GAPSCORE: finding gene and protein names one word at a time. Bioinformatics 2004, 20(2):216-225.
  • Kazama J, Makino T, Ohta Y, Jun'ichi Tsujii: Tuning Support Vector Machines for Biomedical Named Entity Recognition. Proceedings of the Natural Language Processing in the Biomedical Domain (ACL 2002) 2002:1-8.
  • Takeuchi K, Collier N: Bio-Medical Entity Extraction using Support Vector Machines. Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine 2003:57-64.
  • Tanabe L, Wilbur WJ: Tagging gene and protein names in biomedical text. Bioinformatics 2002, 18(8):1124-1132.
  • Hanisch D, Fluck J, Mevissen H, Zimmer R: Playing Biology's Name Game: Identifying Protein Names in Scientific Text. Pacific Symposium on Biocomputing 2003, 8:403-414.
  • Koike A, Takagi T: Gene/Protein/Family Name Recognition in Biomedical Literature. Proceedings of BioLink 2004 Workshop: Linking Biological Literature, Ontologies and Databases: Tools for Users 2004.
  • Ono T, Hishigaki H, Tanigami A, Takagi T: Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 2001, 17(2):155-161.
  • Yoshimasa Tsuruoka, Jun'ichi Tsujii: Boosting Precision and Recall of Dictionary Based Protein Name Recognition. Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine 2003:41-48.
  • Lynette Hirschman, Morgan AA, Yeh AS: Rutabaga by any other name: extracting biological names. Journal of Biomedical Informatics 2002, 35(4):247-259.
  • Lynette Hirschman, Marc E. Colosimo, Morgan AA, Yeh AS: Overview of BioCreAtIvE task 1B: Normalized Gene Lists. BMC Bioinformatics 2005, 6(Suppl 1):S11.
  • Hanisch D, Fundel K, Mevissen H, Zimmer R, Fluck J: ProMiner: Rule-based protein and gene entity recognition. BMC Bionformatics 2005, 6(Suppl 1):S14.
  • Dolinski K, Balakrishnan R, Christie KR, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hirschman JE, Hong EL, Issel-Tarver L, Sethuraman A, Theesfeld CL, Binkley G, Lane C, Schroeder M, Dong S, Weng S, Andrada R, Bostein D, Cherry JM: Saccharomyces Genome Database. [1].
  • Blake J, Richardson J, Bult C, Kadin J, Eppig J, the members of the Mouse Genome Database Group: MGD: The Mouse Genome Database. Nucleic Acids Res 2003, 31:193-195 ics.jax.org/.
  • The FlyBase Consortium: The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res 2003, 31:172-175 [2].
  • Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin M, Michoud K, O'Donovan C, Phan I, Pilbout S, M S: the SWISS-PROT protein knowledgebase and its supplement TrEMBL in (2003). Nucleic Acids Res 2003, 31:365-370 http://www.expasy.org/sprot/sprot-top.html].
  • Wain HM, Lush MJ, Ducluzeau F, Khodiyar VK, Povey S: Genew: the Human Gene Nomenclature Database, 2004 updates. Nucl Acids Res 2004, 32(90001D255-257 [3]
  • Chang CC, Lin CJ: LIBSVM: a library for support vector machines. 2001 Brill E: A simple rule-based part of speech tagger. Proceedings of the Third Conference on Applied Natural Language Processing, Trento, Italy 1992.


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 SimpleApproachToProteinNERKatrin Fundel
Daniel Güttler
Ralf Zimmer
Joannis Apostolakis
A Simple Approach for Protein Name Identification: prospects and limitsBMC Bioinformaticshttp://www.biomedcentral.com/content/pdf/1471-2105-6-S1-S15.pdf2005