2004 TermIdentificationInTheBiomedLit

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Term Mention Recognition Task, Term Mention Classification Task, Term Mention Mapping Task.

Notes

Cited By

2006

2005

Quotes

Author Keywords

Term identification; Term recognition; Term classification; Term mapping; Acronym recognition; Biomedical literature

Abstract

Sophisticated information technologies are needed for effective data acquisition and integration from a growing body of the biomedical literature. Successful term identification is key to getting access to the stored literature information, as it is the terms (and their relationships) that convey knowledge across scientific articles. Due to the complexities of a dynamically changing biomedical terminology, term identification has been recognized as the current bottleneck in text mining, and — as a consequence — has become an important research topic both in natural language processing and biomedical communities. This article overviews state-of-the-art approaches in term identification. The process of identifying terms is analysed through three steps: term recognition, term classification, and term mapping. For each step, main approaches and general trends, along with the major problems, are discussed. By assessing previous work in context of the overall term identification process, the review also tries to delineate needs for future work in the field.

Article Outline

1. Introduction
2. Term identification task
2.1. Term recognition
2.1.1. Dictionary-based approaches
2.1.2. Rule-based approaches
2.1.3. Machine-learning and statistical approaches
2.1.4. Hybrid approaches
2.1.5. Acronym recognition
2.2. Term classification
2.3. Term mapping
2.3.1. Handling term variability
2.3.2. Handling term ambiguity
3. Conclusions and challenges
Acknowledgements
References


References

  • 1. {1} Gaizauskas R, Demetriou G, Humphreys K. Term recognition and classification in biological science journal articles. In: Proceedings of Workshop on Computational Terminology for Medical and Biological Applications. Patras, Greece; 2000. pp. 37-44.
  • 2. Lynette Hirschman, Alexander A. Morgan, Alexander S. Yeh, Rutabaga by any other name: extracting biological names, Journal of Biomedical Informatics, v.35 n.4, p.247-259, August 2002 doi:10.1016/S1532-0464(03)00014-5
  • 3. {3} Tuason O, Chen L, Liu H, Blake JA, Friedman C. Biological nomenclature: a source of lexical knowledge and ambiguity. In: Proceedings of Pacific Symposium on Biocomputations; 2004. pp. 238-49.
  • 4. {4} The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res 2003;31(1):172-5.
  • 5. {5} Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003;31(1):365-70.
  • 6. {6} Collier N, Nobata C, Tsujii J. Automatic term identification and classification in biological texts. In: Proceedings of Natural Language Pacific Rim Symposium. Beijing, China; 1999. pp. 369-74.
  • 7. {7} Pruitt KD, Maglott DR. RefSeq and LocusLink: NCBI genecentered resources. Nucleic Acids Res 2001;29(1): 137-40.
  • 8. {8} Ohta T, Tateisi Y, Mima H, Tsujii J. GENIA corpus: an annotated research abstract corpus in molecular biology domain. In: Proceedings of Human Language Technology Conference (HLT 2002). (2002). pp. 73-7.
  • 9. {9} Nenadic G, Spasic I, Ananiadou S. Mining biomedical abstracts: What is in a term? In: Proceedings of International Joint Conference on NLP. Sanya, China; 2004. pp. 247-54.
  • 10. {10} Krauthammer M, Rzhetsky A, Morozov P, Friedman C. Using BLAST for identifying gene and protein names in journal articles. Gene 2000;259(1-2):245-52.
  • 11. {11} Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL. Genbank. Nucleic Acids Res 2000;28(1):15-8.
  • 12. {12} Altschul SG, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990;215(3):403-10.
  • 13. {13} Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997;25(17):3389-402.. 14. Yoshimasa Tsuruoka, Jun'ichi Tsujii, Probabilistic term variant generator for biomedical terms, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July 28-August 01, 2003, Toronto, Canada doi:10.1145/860435.860467
  • 15. Yoshimasa Tsuruoka, Jun'ichi Tsujii, Boosting precision and recall of dictionary-based protein name recognition, Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine, p.41-48, July 11-11, 2003, Sapporo, Japan doi:10.3115/1118958.1118964
  • 16. {16} Bourigault D, Gomzalez-Mullier I, Gros C. LEXTER, a Natural language processing tool for terminology extraction. In: Proceedings of EURALEX '96. (1996). pp. 771-9.
  • 17. Catherine Blake, Wanda Pratt, Better Rules, Few Features: A Semantic Approach to Selecting Features from Text, Proceedings of the 2001 IEEE International Conference on Data Mining, p.59-66, November 29-December 02, 2001
  • 18. Sophia Ananiadou, A methodology for automatic term recognition, Proceedings of the 15th conference on Computational linguistics, August 05-09, 1994, Kyoto, Japan doi:10.3115/991250.991317
  • 19. {19} Humphreys K, Demetriou G, Gaizauskas R. Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. In: Proceedings of Pacific Symposium on Biocomputations. (2000). pp. 505-16.
  • 20. {20} Gaizauskas R, Demetriou G, Artymiuk PJ, Willett P. Protein structures and information extraction from biological texts: the PASTA system. Bioinformatics 2003;19(1):135-43.
  • 21. {21} Fukuda K, Tamura A, Tsunoda T, Takagi T. Toward information extraction: identifying protein names from biological papers. In: Proceedings of Pacific Symposium on Biocomputations. (1998). pp. 707-18.
  • 22. {22} Narayanaswamy M, Ravikumar KE, Vijay-Shanker K. A biological named entity recognizer. In: Proceedings of Pacific Symposium on Biocomputations. (2003). pp. 427-38.
  • 23. {23} Franzen K, Eriksson G, Olsson F, Asker L, Liden P, Coster J. Protein names and how to find them. lnt J Med Inf 2002;67(1- ):49-61.
  • 24. Wen-Juan Hou, Hsin-Hsi Chen, Enhancing performance of protein name recognizers using collocation, Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine, p.25-32, July 11-11, 2003, Sapporo, Japan doi:10.3115/1118958.1118962
  • 25. Jerry R. Hobbs, Information extraction from biomedical text, Journal of Biomedical Informatics, v.35 n.4, p.260-264, August 2002 doi:10.1016/S1532-0464(03)00015-7
  • 26. {26} Thomas J, Milward D, Ouzounis C, Pulman S, Carroll M. Automatic extraction of protein interactions from scientific abstracts. In: Proceedings of Pacific Symposium on Biocomputations. (2000). p. 541-52.
  • 27. {27} Hobbs JR, Appelt D, Bear J, Israel D, Kameyama M, Stickel M, et al. FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text. In: Finite-State Language Processing. Cambridge: MIT press; 1997. p. 383-406.
  • 28. {28} Andrade MA, Valencia A. Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 1998;14(7):600-7.
  • 29. {29} Hatzivassiloglou V, Duboue PA, Rzhetsky A. Disambiguating proteins, genes, and RNA in text: a machine language approach. Bioinformatics 2001;17(Suppl. 1):S97-106.
  • 30. Mark Craven, Johan Kumlien, Constructing Biological Knowledge Bases by Extracting Information from Text Sources, Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, p.77-86, August 06-10, 1999
  • 31. {31} Hodges PE, Payne WE, Garrels JI. The Yeast Protein Database (YPD): a curated proteome database for Saccaromyces cerevisiae. Nucleic Acids Res 1998;26(1):68-72.
  • 32. Alex Morgan, Lynette Hirschman, Alexander Yeh, Marc Colosimo, Gene name extraction using FlyBase resources, Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine, p.1-8, July 11-11, 2003, Sapporo, Japan doi:10.3115/1118958.1118959
  • 33. Nigel Collier, Chikashi Nobata, Jun-ichi Tsujii, Extracting the names of genes and gene products with a hidden Markov model, Proceedings of the 18th conference on Computational linguistics, p.201-207, July 31-August 04, 2000, Saarbrücken, Germany doi:10.3115/990820.990850
  • 34. Dan Shen, Jie Zhang, Guodong Zhou, Jian Su, Chew-Lim Tan, Effective adaptation of a Hidden Markov Model-based named entity recognizer for biomedical domain, Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine, p.49-56, July 11-11, 2003, Sapporo, Japan doi:10.3115/1118958.1118965
  • 35. Jun'ichi Kazama, Takaki Makino, Yoshihiro Ohta, Jun'ichi Tsujii, Tuning support vector machines for biomedical named entity recognition, Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain, p.1-8, July 11-11, 2002, Phildadelphia, Pennsylvania doi:10.3115/1118149.1118150
  • 36. Koichi Takeuchi, Nigel Collier, Bio-medical entity extraction using Support Vector Machines, Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine, p.57-64, July 11-11, 2003, Sapporo, Japan doi:10.3115/1118958.1118966
  • 37. Kaoru Yamamoto, Taku Kudo, Akihiko Konagaya, Yuji Matsumoto, Protein name tagging for biomedical annotation in text, Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine, p.65-72, July 11-11, 2003, Sapporo, Japan doi:10.3115/1118958.1118967
  • 38. Ki-Joong Lee, Young-Sook Hwang, Hae-Chang Rim, Two-phase biomedical NE recognition based on SVMs, Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine, p.33-40, July 11-11, 2003, Sapporo, Japan doi:10.3115/1118958.1118963
  • 39. {39} Tanabe L, Wilbur WJ. Tagging gene and protein names in biomedical text. Bioinformatics 2002;18(8):1124-32.
  • 40. Eric D. Brill, A simple rule-based part of speech tagger, Proceedings of the third Conference on Applied Natural Language Processing, March 31-April 03, 1992, Trento, Italy doi:10.3115/974499.974526
  • 41. {41} Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004;32(1):D258-61.
  • 42. {42} Proux D, Rechenmann F, Julliard L, Pillet VV, Jacq B. Detecting gene symbols and names in biological texts: a first step toward pertinent information extraction. In: Proceedings of Ninth Workshop on Genome Informatics. (1998). pp. 72-80.
  • 43. {43} Rindflesch TC, Hunter L, Aronson AR. Mining molecular binding terminology from biomedical text. In: Proceedings of AMIA Symposium. (1999). pp. 127-31.
  • 44. {44} Humphreys BL, Lindberg DA, Schoolman HM, Barnett GO. The unified medical language system: an informatics research collaboration. J Am Med Inform Assoc 1998;5(1):1-11.
  • 45. {45} Rindflesch TC, Tanabe L, Weinstein JN, Hunter L. EDGAR: extraction of drugs, genes and relations from the biomedical literature. In: Proceedings of Pacific Symposium on Biocomputations. (2000). pp. 517-28.
  • 46. {46} (Frantzi et al., 2000) ⇒ Katerina Frantzi, Sophia Ananiadou, and Hideki Mima. (2000). “Automatic Recognition of Multi-Word Terms: The Cvalue/NC-value method.” In: International Journal on Digital Libraries, 3(2). doi:10.1007/s007999900023
  • 47. {47} Ananiadou S, Albert S, Schuhmann D. Evaluation of automatic term recognition of nuclear receptors from medline. Genome Informatics Series 2000.
  • 48. Goran Nenadic, Simon Rice, Irena Spasic, Sophia Ananiadou, Benjamin Stapley, Selecting text features for gene name classification: from documents to terms, Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine, p.121-128, July 11-11, 2003, Sapporo, Japan doi:10.3115/1118958.1118974
  • 49. {49} Nenadic G, Spasic I, Ananiadou S. Automatic Acronym acquisition and term variation management within domain-Specific texts. In: Proceedings of LREC-3. Las Palmas, Spain; 2002. pp. 2155-62.
  • 50. {50} Nenadic G, Spasic I, Ananiadou S. Terminology-driven mining of biomedical literature. Bioinformatics 2003;19(8):938-43.
  • 51. {51} Adar E. S-RAD: A Simple and Robust Abbrevation Dictionary. USA: HP Lab; 2002.
  • 52. {52} Chang JT, Schutze H, Altman RB. Creating an online dictionary of abbreviations from medline. J Am Med Inform Assoc 2002;9(6):612-40.
  • 53. {53} Rimer M, O'Connell M. BioABACUS: a database of abbreviations and acronyms in biotechnology and computer science. Bioinformatics 1998;14(10):888-9.
  • 54. Hong Yu, Vasileios Hatzivassiloglou, Andrey Rzhetsky, W. John Wilbur, Automatically identifying gene/protein terms in MEDLINE abstracts, Journal of Biomedical Informatics, v.35 n.5/6, p.322-330, October 2002 doi:10.1016/S1532-0464(03)00032-7
  • 55. {55} Yoshida M, Fukuda K, Takagi T. PNAD-CSS: A Workbench for Constructing a Protein name abbrevation dictionary. Bioinformatics 2000;16(2):169-75.
  • 56. {56} Liu H, Aronson AR, Friedman C. A study of abbreviations in MEDLINE abstracts. In: Proceedings of AMIA Symposium. (2002). pp. 464-8.
  • 57. {57} Yu H, Hripcsak G, Friedman C. Mapping abbreviations to full forms in biomedical articles. J Am Med Inform Assoc 2002;9(3):262-72.
  • 58. {58} Schwartz AS, Hearst MA. A simple algorithm for identifying abbreviation definitions in biomedical text. In: Proceedings of Pacific Symposium on Biocomputations. (2003). pp. 451-62.
  • 59. {59} Pustejovsky J, Castano J, Cochran B, Kotecki M, Morrell M, Rumshisky A. Extraction and Disambiguation of Acronym-Meaning Pairs in Medline. In: Proceedings of Medinformatics. 2001.
  • 60. Manabu Torii, Sachin Kamboj, K. Vijay-Shanker, An investigation of various information sources for classifying biological names, Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine, p.113-120, July 11-11, 2003, Sapporo, Japan doi:10.3115/1118958.1118973
  • 61. {61} Nobata C, Collier N, Tsujii J. Automatic term identification and classification in biological texts. In: Proceedings of Natural Language Pacific Rim Symposium. (1999). pp. 369-74.
  • 62. {62} Torii M, Vijay-Shanker K. Using unlabeled MEDLINE abstracts for biological named entity classification. In: Proceedings of Genome Informatics Workshop 2002. (2002). pp. 567-658.
  • 63. Irena Spasic, Goran Nenadic, Sophia Ananiadou, Using domain-specific verbs for term classification, Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine, p.17-24, July 11-11, 2003, Sapporo, Japan doi:10.3115/1118958.1118961
  • 64. {64} Raychaudhuri S, Chang JT, Sutphin PD, Altman RB. Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. Genome Res 2002;12(1): 203-14.
  • 65. {65} Seewald A. Towards recognizing domain and species from MEDLINE publications. In: Proceedings of European Workshops on Data Mining and Text Mining for Bioinformatics. (2003). pp. 51-8.
  • 66. K. Bretonnel Cohen, George K. Acquaah-Mensah, Andrew E. Dolbey, Lawrence Hunter, Contrast and variability in gene names, Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain, p.14-20, July 11-11, 2002, Phildadelphia, Pennsylvania doi:10.3115/1118149.1118152
  • 67. {67} Jacquemin C. Spotting and Discovering Terms through NLP. Cambridge, MA: MIT Press; 2001.
  • 68. {68} Jacqnemin C, Tzoukermann E. NLP for Term Variant Extraction: A Synergy of Morphology, Lexicon and Syntax. In: Strzalkowski T, editor, Natural Language Information Retrieval. Boston, MA: Kluwer; 1999. p. 25-74.
  • 69. {69} Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In: Proceedings of AMIA Symposium. (2001). pp. 17-21.
  • 70. {70} Yu H, Agichtein E. Extracting synonymous gene and protein terms from biological literature. Bioinformatics 2003;19(Suppl. 1): I340-9.
  • 71. {71} Liu H, Johnson SB, Friedman C. Automatic resolution of ambiguous terms based on machine learning and conceptual relations in the UMLS. J Am Med Inform Assoc 2002;9(6):621-36.
  • 72. Serguei Pakhomov, Semi-supervised Maximum Entropy based approach to acronym and abbreviation normalization in medical texts, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, July 07-12, 2002, Philadelphia, Pennsylvania doi:10.3115/1073083.1073111
  • 73. {73} Blaschke C, Valencia A. Molecular biology nomenclature thwarts information-extraction progress. IEEE Intell Syst 2002;17(3): 73-6.
  • 74. {74} Ogren P, Cohen K, Acquaah-Mensah G, Eberlein J, Hunter L. The compositional structure of gene ontology terms. In: Proceedings of Pacific Symposium on Biocomputations 2004. pp. 214-25.
  • 75. {75} Hisamitsu T, Tsujii J. Measuring term representativeness. In: Pazienza MT, editor. Information Extraction in the Web Era, LNAI 2700. New York, NY: Springer; 2003. pp. 45-76.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 TermIdentificationInTheBiomedLitMichael Krauthammer
Goran Nenadic
Term Identification in the Biomedical LiteratureJournal of Biomedical Informaticshttp://personalpages.manchester.ac.uk/staff/G.Nenadic/papers\Krauthammer Nenadic pre print.pdf10.1016/j.jbi.2004.08.0042004