2007 RuleBasedProteinTermIdentification

Jump to: navigation, search

Subject Headings: Organism Component Semantic Relation Recognition Task, ITI TXM Corpora, Organism Mention Normalization Task, Organism NER.


Cited By




In biomedical articles, terms often refer to different protein entities. For example, an arbitrary occurrence of term p53 might denote thousands of proteins across a number of species. A human annotator is able to resolve this ambiguity relatively easily, by looking at its context and if necessary, by searching an appropriate protein database. However, this phenomenon may cause much trouble to a text mining system, which does not understand human languages and hence can not identify the correct protein that the term refers to. In this paper, we present a Term Identification system which automatically assigns unique identifiers, as found in a protein database, to ambiguous protein mentions in texts. Unlike other solutions described in literature, which only work on gene/protein mentions on a specific model organism, our system is able to tackle protein mentions across many species, by integrating a machine-learning based species tagger. We have compared the performance of our automatic system to that of human annotators, with very promising results.


  • 1. Michael Krauthammer, Nenadic, G.: Term identification in the biomedical literature. Journal of Biomedical Informatics Special Issue on Named Entity Recognition in Biomedicine) 37(6) (2004) 512–526
  • 2. Lynette Hirschman, Morgan, A.A., Yeh, A.S.: Rutabaga by any other name: extracting biological names. J Biomed Inform 35(4) (2002) 247–259
  • 3. Tuason, O., Chen, L., Liu, H., Blake, J.A., Friedman, C.: Biological nomenclature: A source of lexical knowledge and ambiguity. In: Proceedings of Pac Symp Biocomput. (2004). 238–249
  • 4. Nenadic, G., Ananiadou, S., McNaught, J.: Enhancing automatic term recognition through term variation. In: Proceedings of 20th International Conference on Computational Linguistics (Coling 2004), Geneva, Switzerland (2004)
  • 5. Chen, L., Liu, H., Friedman, C.: Gene name ambiguity of eukaryotic nomenclatures.Bioinformatics (2005) 248–256
  • 6. (Fang et al., 2006) ⇒ Haw-ren Fang, Kevin P. Murphy, Yang Jin, Jessica S. Kim, and Peter S. White. (2006). “Human Gene Name Normalization Using Text Matching with Automatically Extracted Synonym Dictionaries.” In: Proceedings of the BioNLP Workshop on Linking Natural Language Processing and Diology (BioNLP 2006).
  • 7. Lynette Hirschman, Colosimo, M., Morgan, A., Columbe, J., Yeh, A.: Task 1B: Gene list task BioCreAtIve workshop. In: BioCreative: Critical Assessment for Information Extraction in Biology. (2004)
  • 8. Hanisch, D., Fundel, K., Mevissen, H.T., Zimmer, R., Fluck, J.: ProMiner: Organism-specific protein name detection using approximate string matching. BMC Bioinformatics 6(Suppl 1):S14 (2005)
  • 9. Crim, J., McDonald, R., Fernando Pereira: Automatically annotating documents with normalized gene lists. BMC Bioinformatics 6(Suppl 1):S13 (2005)
  • 10. Fundel, K., Güttler, D., Zimmer, R., Apostolakis, J.: A simple approach for protein name identification: prospects and limits. BMC Bioinformatics 6(Suppl 1):S15 (2005)
  • 11. Tamames, J.: Text detective: A rule-based system for gene annotation. BMC Bioinformatics 6(Suppl 1):S10 (2005)
  • 12. Hackey, B., Nguyen, H., Nissim, M., Alex, B., Grover, C.: Grounding gene mentions with respect to gene database identifiers. In: BioCreAtIvE Workshop Handouts. (2004). Granada, Spain.
  • 13. Liu, H.: BioTagger: A biological entity tagging system. In: BioCreAtIvE Workshop Handouts. (2004). Granada, Spain.
  • 14. Morgan, A., Lynette Hirschman, Colosimo, M., Yeh, A., Colombe, J.: Gene name identification and normalization using a model organism database. J Biomedical Informatics 37 (2004) 396–410
  • 15. Hanisch, D., Fluck, J., Mevissen, H., Zimmer, R.: Playing biology’s name game: identifying protein names in scientific text. Pac Symp Biocomput 403-14 (2003)
  • 16. Rada Mihalcea, T. Chklovski, A. Killgariff. (2004). “The Senseval-3 English lexical sample task.” In: Proceedings of the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (Senseval-3). (2004)
  • 17. Schwartz, A., Hearst, M.: A simople algorithm for identifying abbreviation definitions in biomedical texts. In: Proceedings of the Pacific Symposium on Biocomputing.(2003)
  • 18. Ghanem, M., Guo, Y., Lodhi, H., Zhang, Y.: Automatic scientific text classification using local patterns: KDD Cup (2002). In: ACM SIGKDD Explorations Newsletter. Volume 4(2). (2003). 95–96,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 RuleBasedProteinTermIdentificationXinglong WangRule-based Protein Term Identification with Help from Automatic Species TaggingProceedings of CICLINGhttp://www.ltg.ed.ac.uk/np/publications/ltg/papers/Wang2007Rulebased.pdf10.1007/978-3-540-70939-8_262007
AuthorXinglong Wang +
doi10.1007/978-3-540-70939-8_26 +
journalProceedings of CICLING +
titleRule-based Protein Term Identification with Help from Automatic Species Tagging +
titleUrlhttp://www.ltg.ed.ac.uk/np/publications/ltg/papers/Wang2007Rulebased.pdf +
year2007 +