2002 RutabagaByAnyOtherName

Jump to navigation Jump to search

Subject Headings: Term Recognition Task.


Cited By



As the pace of biological research accelerates, biologists are becoming increasingly reliant on computers to manage the information explosion. Biologists communicate their research findings by relying on precise biological terms; these terms then provide indices into the literature and across the growing number of biological databases. This article examines emerging techniques to access biological resources through extraction of entity names and relations among them. Information extraction has been an active area of research in natural language processing and there are promising results for information extraction applied to news stories, e.g., balanced precision and recall in the 93–95% range for identifying person, organization and location names. But these results do not seem to transfer directly to biological names, where results remain in the 75–80% range. Multiple factors may be involved, including absence of shared training and test sets for rigorous measures of progress, lack of annotated training data specific to biological tasks, pervasive ambiguity of terms, frequent introduction of new terms, and a mismatch between evaluation tasks as defined for news and real biological problems. We present evidence from a simple lexical matching exercise that illustrates some specific problems encountered when identifying biological names. We conclude by outlining a research agenda to raise performance of named entity tagging to a level where it can be used to perform tasks of biological importance.

Table of Contents

1. Background
1.1. Why names are important
1.2. Extracting names
2. Extracting names in biology
2.1. Information extraction for news
2.2. Information extraction in biology
3. Are names in biology harder than names in news?
3.1. The experience factor
3.2. Training data
3.3. Interannotator agreement and task definition
3.4. A systematic comparison of biology and news
4. Naming biological entities
4.1. Biological name formation
4.2. A lexical-based pattern matching experiment
5. Lessons learned

3.3. Interannotator agreement and task definition

Interannotator agreement is far lower for the biological tasks than for MUC newswire (F-measure of 84-89% vs. 97% for news-see Table 1). This may be due to the fact that biologists are being asked to perform a linguistic task that is, from their point of view, somewhat artificial. Biologists may not need to look at every occurrence of a term in an article. …


  • 1. {1} Blaschke C, Hirschman L, Valencia A. Information extraction in molecular biology. Briefings in Bioinformatics 2002;3:154-65.
  • 2. C. A. Goble, R. Stevens, G. Ng, Sean Bechhofer, N. W. Paton, P. G. Baker, M. Peim, A. Brass, Transparent access to multiple bioinformatics information sources, IBM Systems Journal, v.40 n.2, p.532-551, February 2001
  • 3. {3} Hahn U, Romacker M, Schulz S. Creating knowledge repositories from biomedical reports: the MEDSYNDIKATE text mining system. Pacific Symp Biocomputing 2002;7:338-49.
  • 4. {4} Raychaudhuri S, Chang JT, Sutphin P, Altman RB. Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. Genome Res 2002;12:203-14.
  • 5. {5} Chang J, Raychaudhuri S, Altman RB. Including biological literature improves homology search. Pacific Symp Biocomputing 2001;6:374-83.
  • 6. {6} Masys D. Linking microarray data to the literature. Nat Genet 2001;28:9-10.
  • 7. Christian Blaschke, Miguel A. Andrade, Christos Ouzounis, Alfonso Valencia, Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions, Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, p.60-67, August 06-10, 1999
  • 8. {8} Krauthammer M, Kra P, Iossifov I, Gomez SM, Hripcsak G, Hatzivassiloglou V, Friedman C, Rzhetsky A. Of truth and pathways: chasing bits of information through myriads of articles. Bioinformatics 2002;18:S249-57.
  • 9. L. Hirschman, R. Gaizauskas, Natural language question answering: the view from here, Natural Language Engineering, v.7 n.4, p.275-300, December 2001 doi:10.1017/S1351324901002807
  • 10. {10} Hirschman L. The evolution of evaluation: lessons from the message understanding conferences. Comput Speech and Language 1998;12:281-305.
  • 11. Beth M. Sundheim, Overview of results of the MUC-6 evaluation, Proceedings of the 6th conference on Message understanding, November 06-08, 1995, Columbia, Maryland doi:10.3115/1072399.1072402
  • 12. {12} MUC-7. Proceedings of the Seventh Message Understanding Conference (MUC-7), Defense Advanced Research Projects Agency, 1998. Available at <http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/overview>.
  • 13. {13} Chinchor N, Marsh E. Message Understanding Conference Proceedings: MUC-7, 1998. Available at <http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/overview.html>.
  • 14. {14} Mikheev A, Grover C, Moens M. Description of the LTG System used for MUC-7, 1998. Available at <http://www.itl.nist.gov/iaui/894.02/related_proceedings/ltg_muc7.pdf>.
  • 15. {15} Fukumoto J, Masui F, Shimcheta M, Saski M. Description of the Oki System as used for MUC-7, 1998. Available at <http://www.itl.nist.gov/iaui/894.02/related_proceedings/oki_muc7.pdf>.
  • 16. {16} Yu S, Bai S, Wu P. Description of the Kent Ridge Digital Labs System used for MUC-7, 1998. Available on-line at <http://www.itl.nist.gov/iaui/894.02/related_toc.html>.
  • 17. {17} C. Aone, L. Halverson, T. Hampton, M. Ramos-Santacruz, SRA: description of the IE2_System used for MUC-7, 1998. Available on-line at <http://www.itl.nist.gov/iaui/894.02/related_toc.html>.
  • 18. Daniel M. Bikel, Richard Schwartz, Ralph M. Weischedel, An Algorithm that Learns What‘s in a Name, Machine Learning, v.34 n.1-3, p.211-231, Feb. 1999 doi:10.1023/A:1007558221122
  • 19. {19} Fukuda K, Tsunoda T, Tamura A, Takagi T. Toward information extraction: identifying protein names from biological papers. Pacific Symp Biocomputing 1998;3:705-16.
  • 20. {20} Proux D, Rechenmann F, Julliard L, Pillet V, Jacq B, Detecting gene symbols and names in biological texts: a first step toward pertinent information extraction. In: Proceedings of the 9th Workshop on Genome Informatics; 1998. p. 72-80.
  • 21. {21} Krauthammer M, Rzhetsky A, Morosov P, Friedman C. Using BLAST for identifying gene and protein names in journal articles. Gene 2000;259:245-52.
  • 22. Nigel Collier, Chikashi Nobata, Jun-ichi Tsujii, Extracting the names of genes and gene products with a hidden Markov model, Proceedings of the 18th conference on Computational linguistics, p.201-207, July 31-August 04, 2000, Saarbrücken, Germany doi:10.3115/990820.990850
  • 23. {23} Gaizauskas R, Demetriou G, Artymiuk PJ, Willett P. Protein structures and information extraction from biological texts: the PASTA system. Bioinformatics 2003;19:135-43.
  • 24. {24} Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A. GENIES: a Natural Language Processing System for the extraction of molecular pathways from journal articles. Bioinformatics Suppl 2001;1:74-82.
  • 25. {25} Ohta T, Tateishi Y, Collier N, Nobata C, Tsujii J. Building an annotated corpus from biology research papers. In: Proceedings of COLING 2000 Workshop on Semantic Annotation and Intelligent Content; 2000. p. 28-34.
  • 26. {26} Tanabe L, Wilbur J. Tagging gene and protein names in biomedical text. Bioinformatics 2002;18:1124-32.
  • 27. Mark Craven, Johan Kumlien, Constructing Biological Knowledge Bases by Extracting Information from Text Sources, Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, p.77-86, August 06-10, 1999
  • 28. Chikashi Nobata, Nigel Collier, Jun'ichi Tsujii, Comparison between tagged corpora for the named entity task, Proceedings of the workshop on Comparing corpora, p.20-27, October 07-07, 2000, Hong Kong doi:10.3115/1117729.1117733
  • 29. {29} Hatzivassiloglou V, Duboue P, Rzhetsky A. Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics 2001;17:S97-S106.
  • 30. Mark Stevenson, Robert Gaizauskas, Using corpus-derived name lists for named entity recognition, Proceedings of the sixth Conference on Applied Natural Language Processing, p.290-295, April 29-May 04, 2000, Seattle, Washington doi:10.3115/974147.974187,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2002 RutabagaByAnyOtherNameAlexander A. Morgan
Lynette Hirschman
Alexander S. Ye
Rutabaga by Any Other Name: extracting biological namesJournal of Biomedical Informaticshttp://www.mitre.org/work/best papers/03/hirschman rutabaga/hirschman.pdf10.1016/S1532-0464(03)00014-52002