2002 ProteinNamesAndHowToFindThem

Jump to: navigation, search

Subject Headings: Named Entity Recognition Task


Cited By



  • A prerequisite for all higher level information extraction tasks is the identification of unknown names in text. Today, when large corpora can consist of billions of words, it is of utmost importance to develop accurate techniques for the automatic detection, extraction and categorization of named entities in these corpora. Although named entity recognition might be regarded a solved problem in some domains, it still poses a significant challenge in others. In this work we focus on one of the more difficult tasks, the identification of protein names in text. This task presents several interesting difficulties because of the named entities variant structural characteristics, their sometimes unclear status as names, the lack of common standards and fixed nomenclatures, and the specifics of the texts in the molecular biology domain in which they appear. We describe how we approached these and other difficulties in the implementation of Yapex, a system for the automatic identification of protein names in text. We also evaluate Yapex under four different notions of correctness and compare its performance to that of another publicly available system for protein name recognition.


  • applications.

In this paper we

    • discuss the role of automatic analysis of text in a specialized domain such as molecular biology (Sections 1.1 1.3)
    • discuss the nature of names in this domain and touch on the necessity of detecting named entities as a rst step towards higher levels of analysis and re nement of information (Sections 1.4 1.6)
    • describe a system that uses a combination of heuristic pattern matching techniques and full syntactic analysis to find names of proteins in running text (Section 2)
    • discuss the general problems connected to the evaluation of such systems and propose an approach to evaluation of multi-word named entities (Sections 3.2 and 4)
    • evaluate the modules in our system and compare the system with another protein name tagger on a test corpus along our proposed notions of correctness (Section 3.3).


[1] Fredrik Olsson, Preben Hansen, Kristofer Franzén, and Jussi Karlgren. Information Access and Re nement A research theme. ERCIM News, 46, July 2001. [2] Ralph Grishman. Information Extraction: Techniques and challenges. In Maria Teresa Pazienza, editor, Information Extraction - A Multidisci- plinary Approach to an Emerging Information Technology, pages 10 27. Springer, 1997. [3] Proceedings of the Seventh Message Understanding Conference (MUC-7), Virginia USA, April - May (1998). Morgan Kaufmann. [4] Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, Maryland, USA, November (1995). Morgan Kaufman. [5] Proceedings of the Fifth Message Understanding Conference (MUC-5), Baltimore, Maryland, USA, August (1993). Morgan Kaufman. [6] Proceedings of the Fourth Message Understanding Conference (MUC-4). Morgan Kaufman, June 1992. [7] Proceedings of the Third Message Understanding Conference (MUC-3). Morgan Kaufman, May 1991. [8] Andrew Borthwick, John Sterling, Eugene Agichtein, and Ralph Grish- man. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In: Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, Canada, August 1998. [9] Chikashi Nobata, Nigel Collier, and Jun'ichi Tsujii. Automatic term iden- ti cation and classification in biology texts. In: Proceedings of the Natural Language Paci c Rim Symposium (NLPRS'2000), pages 369 374, Novem- ber 1999. [10] Nigel Collier, Chikashi Nobata, and Jun'ichi Tsujii. Extracting the names of genes and gene products with a Hidden Markov Model. In: Proceedingseed- ings of the 18th International Conference on Computational Linguistics (COLING-2000), pages 201 207, August 2000. [11] Ken-ichiro Fukuda, Tatsuhiko Tsunoda, Ayuchi Tamura, and Toshihisa Takagi. Toward Information Extraction: Identifying protein names from biological papers. In: Proceedings of the Paci c Symposium on Biocom- puting (PSB'98), pages 705 716, Maui, Hawaii, January 4-9 1998. 15 [12] Robert Gaizauskas, Kevin Humphreys, and George Demetriou. Informa- tion Extraction from biological science journal articles: Enzyme interac- tions and protein structures. In Martin G. Hicks, editor, Proceedings of the workshop Chemical Data Analysis in the Large: The Challenge of the Automation Age, 2001. [13] Amos Bairoch and Rolf Apweiler. The SWISS-PROT protein sequence database and its supplement TrEMBL in (2000). Nucl. Acids. Res., 28:45 48, 2000. [14] Pasi Tapanainen and Timo Järvinen. A non-projective dependency parser. In: Proceedings of the 5th Conference on Applied Natural Language Pro- cessing, pages 64 71, Washington D.C., April (1997). Association for Com- putational Linguistics. [15] Berry de Bruijn and Joel Martin. Protein name tagging. Presented as a poster at the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB'00), 2000. [16] Nigel Collier, Hyun Seok Park, Norihiro Ogata, Yuka Tateishi, Chikashi Nobata, Tomoko Ohta, Tateshi Sekimizu, Hisao Imai, Katsutoshi Ibushi, and Jun'ichi Tsujii. The GENIA project: corpus-based knowledge acqui- sition and information extraction from genome research papers. In Pro- ceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 271 272, June 1999.


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2002 ProteinNamesAndHowToFindThemKristofer Franzen
Gunnar Eriksson
Fredrik Olsson
Protein names and how to find themhttp://ice.sics.se/~franzen/Artiklar/Ijmi/ijmi02franzen01.pdf2002