2005 UnsupGeneProtNENorml

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Entity Mention Normalization Algorithm, Dictionary-based Algorithm, Genia Corpus, BioCreative Task 1B, Protein NER.

Notes

Cited By

2008

Quotes

Abstract

Gene and protein named-entity recognition (NER) and normalization is often treated as a two-step process. While the first step, NER, has received considerable attention over the last few years, normalization has received much less attention. We have built a dictionary based gene and protein NER and normalization system that requires no supervised training and no human intervention to build the dictionaries from online genomics resources. We have tested our system on the Genia corpus and the BioCreative Task 1B mouse and yeast corpora and achieved a level of performance comparable to state-of-the-art systems that require supervised learning and manual dictionary creation. Our technique should also work for organisms following similar naming conventions as mouse, such as human. Further evaluation and improvement of gene/protein NER and normalization systems is somewhat hampered by the lack of larger test collections and collections for additional organisms, such as human.

Introduction

In the genomics era, the field of biomedical research finds itself in the ironic situation of generating new information more rapidly than ever before, while at the same time individual researchers are having more difficulty getting the specific information they need. This hampers their productivity and efficiency. Text mining has been proposed as a means to assist researchers in handling the current expansion of the biomedical knowledge base (Hirschman et al., 2002). Fundamental tasks in text mining are named entity recognition (NER) and normalization. NER is the identification of text terms referring to items of interest, and normalization is the mapping of these terms to the unique concept to which they refer. Once the concepts of interest are identified, text mining can proceed to extract facts and other relationships of interest that involve these recognized entities. With the current research focus on genomics, identifying genes and proteins in biomedical text has become a fundamental problem in biomedical text mining research (Cohen and Hersh, 2005). The goal of our work here is to explore the potential of using curated genomics databases for dictionary-based NER and normalization. These databases contain a large number of the names, symbols, and synonyms and would likely enable recognition of a wide range of genes on a wide range of literature without corpus-specific training.

Gene and protein NER and normalization can be viewed as a two-step process. The first step, NER, identifies the strings within a sample of text that refer to genes and proteins. The second step, normalization, determines the specific genes and proteins referred to by the text strings.

Many investigators have examined the initial step of gene and protein NER. One of the most successful rules-based approaches to gene and protein NER in biomedical texts has been the AbGene system (Tanabe and Wilbur, 2002), which has been used by several other researchers. After training on hand-tagged sentences from biomedical text, it applies a Brill-style tagger (Brill, 1992) and manually generated postprocessing rules. AbGene achieves a precision of 85.7% at a recall of 66.7% (F1 = 75%). Another successful system is GAPSCORE (Chang et al., 2004). It assigns a numeric score to each word in a sentence based on appearance, morphology, and context of the word and then applies a classifier trained on these features. After training on the Yapex corpus (Franzen et al., 2002), the system achieved a precision of 81.5% at a recall of 83.3% for partial matches.

For many applications of text mining, the second step, normalization is as important as the first step. Many biomedical concepts, including genes and proteins, have large numbers of synonymous terms (Yu and Agichtein, 2003, Tuason et al., 2004). Without normalization, different terms for the same concept are treated as distinct items, which can distort statistical and other analysis. Normalization can aggregate references a given gene or protein and can therefore increase the sample size for concepts with common synonyms. However, normalization of gene and protein references has not received as much attention as the NER step.

One recent conference, the BioCreative Critical Assessment for Information Extraction in Biology (Krallinger, 2004), had a challenge task that addressed gene and protein normalization. The task was to identify the specific genes mentioned in a set of abstracts given that the organism of interest was mouse, fly, or yeast. Training and test collections of about 250 abstracts were manually prepared and made available to the participants along with synonym lists. Seven groups participated in this challenge task (Hirschman et al., 2004), with the best F-measures ranging from 92.1% on yeast to 79.1% on mouse. The overall best performing system used a combination of hand built dictionaries, approximate string matching, and parameter tuning based on the training data, and performed match disambiguation using a collection of biomedical abbreviations combined with approximate string match scoring and preferring concepts with a high count of occurring terms (Hanisch et al., 2004).

One thing that almost all of these systems have in common is that they need to be trained on a text corpus and/or use manually built dictionaries based on the training corpus. Since the training corpus may be a small sample of the total relevant biomedical literature, it is uncertain how the performance of these systems will change over time or when applied to other sources of biomedical text. Also, since new genes and proteins are being described all the time, it is unclear how these systems will handle genes discovered after system training is complete. This is may especially be a problem for normalization.

Dictionary-based approaches to gene and protein NER and normalization that require no training have several advantages over orthographic, lexical, and contextual based approaches. Currently there are few test collections for gene and protein normalization, and they are relatively small (Hirschman et al., 2004). Unsupervised systems therefore may perform more uniformly over different data sets and over time for the near future. Since they are not dependent upon training to discover local Orthographic or lexigraphic clues, they can recognize long multi-word names as easily as short forms. Dictionary-based approaches can also normalize gene and protein names, reducing many synonyms and phrases representing the same concept to a single identifier for that gene or protein.

In addition, dictionary-based approaches can make use of the huge amount of information in curated genomics databases. Currently, there is an enormous amount of manual curation activity related to gene and protein function. Several genomics databases contain large amounts of curated gene and protein name symbols as well as full names. Groups such as the Human Genome Organisation (HUGO), Mouse Genome Institute (MGI), UniProt, and the National Center for Biotechnology Information (NCBI) collect and organize information on gene and proteins, much of it from the biomedical literature, including gene names, symbols, and synonyms. Dictionary-based approaches provide a way to make use of this information for gene and protein NER and normalization. As the databases are updated by the curating organization, a NER system based on these databases can automatically incorporate additional new names and symbols. These approaches can also be very fast. Much of the computation can be performed during the construction of the dictionary. This can leave the actual searching for dictionary terms a simple and rapid process.

Tsuruoka and Tsujii recently studied the use of dictionary-based approaches for protein name recognition (Tsuruoka and Tsujii, 2004), although they did not evaluate the normalization performance. They applied a probabilistic term variant generator to expand the dictionary, and a Bayesian contextual filter with a sub-sentence window size to classify the terms in the GENIA corpus as likely to represent protein names. Overall they obtained a precision of 71.1%, at a recall of 62.3% and an F-measure of 66.6%. Tsuruoka and Tsujii did not make use of curated database information, and instead split the GENIA corpus into training and test data sets of 1800 and 200 abstracts respectively, and extracted the tagged protein names from the training set to use as a dictionary. These results compare well to, being a bit below, other non-dictionary based methods applied to the GENIA corpus (Lee et al., 2004, Zhou et al., 2004).

In this work we attempt to answer several questions pertaining to dictionary-based gene/protein NER:

  • What curated databases provide the best collection of names and symbols?
  • Can simple rules generate sufficient orthographic variants?
  • Can common English word lists be used to decrease false positives?
  • What is the overall normalization performance of an unsupervised dictionary-based approach?

Methods

A dictionary-based NER system starts out with a list, potentially very large, of text strings, called terms, which represent concepts of interest. In our system, the terms are organized by concept, in this case a unique identifier for the gene or protein. All terms for a given concept are kept together. The combination of terms indexed by concept is similar to a traditional thesaurus, and when used for NER and normalization is usually called a dictionary. When a term is found in a sample of text, it is a simple process to map the term to the unique gene or protein that it represents. There are several unique identifiers in use by the gene curation organizations, we chose to use the official symbol as a default, but it is easy to use other database identifiers as needed.

Conclusions and Future Work

These results demonstrate that an unsupervised dictionary-based approach to gene and protein NER and normalization can be effective. The dictionaries can be created automatically without human intervention or review. Dictionary-based systems such as ours can be set up to automatically update themselves by downloading the database files on the Internet and preprocessing the files into updated dictionaries. This could be done on a nightly basis if necessary, since the entire dictionary creation process only takes a few minutes. One general database, combined with an organism-specific database for each species, is sufficient.

Our work is distinguished from other dictionary-based work such as Tsurukoka and Tsujii, and Hanisch et al. in several ways. Unlike both of these prior investigators, we use on-line curated information as our primary source of terms, instead of deriving them from a training set, and have shown both which databases to use and how to process them into effective sources of terms for NER. Our textual variants are generated by simple rules determined by domain knowledge instead of machine learning on training data. Lastly, the disambiguation algorithm presented here is unique and has been shown to have a positive impact on performance.

The system is as accurate as other more complex approaches. It does not require training, and so may be less sensitive to specific characteristics of a given text corpus. It may also be applied to organisms for which there do not exist sufficient training and test collections. In addition, the system is very fast. This may enable some text mining tasks to be done for users in real time, rather than the batch processing mode that is currently most common in biomedical text mining research.

Dictionary-based approaches are likely to remain an essential part of gene and protein normalization, even if the NER step is handled by other methods. Further work is necessary to determine the best manner to combine automatically created dictionaries with trained NER systems. It may be the case that different approaches work best for different organisms, depending upon the specific naming conventions of scientists working on that species.

References

  • Brill, E. (1992) A simple rule-based part of spech tagger. In: Proceedings of the Third Conference on Applied Natural Language Processing.
  • Carroll, J. B., Davies, P. and Richman, B. (1971) The American heritage word frequency book. Houghton Mifflin, Boston,.
  • Chang, J. T., Hinrich Schütze and Altman, R. B. (2004). GAPSCORE: finding gene and protein names one word at a time. Bioinformatics, 20, 216-25.
  • Chen, L., Liu, H. and Friedman, C. (2005). Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics, 21, 248-56.
  • Cherry, J. M. (1995) Genetic nomenclature guide. Saccharomyces cerevisiae. Trends Genet, 11-2.
  • Cohen, A. M. and Hersh, W. (2005). A Survey of Current Work in Biomedical Text Mining. Briefings in Bioinformatics, 6, 57-71.
  • Cohen, A. M., Hersh, W. R., Dubay, C. and Spackman, K. (2005). Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts. BMC Bioinformatics, 6,
  • Collier, N. and Takeuchi, K. (2004). Comparison of characterlevel and part of speech features for name recognition in biomedical texts. J Biomed Inform, 37, 423-35.
  • Franzen, K., Eriksson, G., Olsson, F., Asker, L., Liden, P. and Coster, J. (2002). Protein names and how to find them. Int J Med Inf, 67, 49-61.
  • Hanisch, D., Fundel, K., Mevissen, H. T., Zimmer, R. and Fluck, J. (2004). ProMiner: Organism-specific protein name detection using approximate string matching. In BioCreative: Critical Assessment for Information Extraction in Biology.
  • Lynette Hirschman, Alexander A. Morgan and Yeh, A. S. (2002). Rutabaga by any other name: extracting biological names. J Biomed Inform, 35, 247-59.
  • Lynette Hirschman, Colosimo, M., Morgan, A., Columbe, J. and Yeh, A. (2004). Task 1B: Gene List Task BioCreAtIve Workshop. In BioCreative: Critical Assessment for Information Extraction in Biology.
  • Hu, Z. Z., Mani, I., Hermoso, V., Liu, H. and Wu, C. H. (2004). iProLINK: an integrated protein resource for literature mining. Comput Biol Chem, 28, 409-16.
  • Kim, J. D., Ohta, T., Tateisi, Y. and Jun'ichi Tsujii (2003) GENIA corpus - a semantically annotated corpus for bio-textmining. Bioinformatics, 19, i180-i182.
  • Krallinger, M. (2004). BioCreAtIvE - Critical Assessment of Information Extraction systems in Biology. http://www.pdg.cnb.uam.es/BioLINK/BioCreative.eval.html
  • Lee, K. J., Hwang, Y. S., Kim, S. and Rim, H. C. (2004). Biomedical named entity recognition using two-phase model based on SVMs. J Biomed Inform, 37, 436-47.
  • Tanabe, L. and Wilbur, W. J. (2002). Tagging gene and protein names in biomedical text. Bioinformatics, 18, 1124-32.
  • Yoshimasa Tsuruoka and Jun'ichi Tsujii (2004) Improving the performance of dictionary-based approaches in protein name recognition. J Biomed Inform, 37, 461-70.
  • Tuason, O., Chen, L., Liu, H., Blake, J. A. and Friedman, C. (2004). Biological nomenclatures: a source of lexical knowledge and ambiguity. Pac Symp Biocomput, 238-49.
  • Ward, G. (2000) Grady Ward's Moby. http://www.dcs.shef.ac.uk/research/ilash/Moby/mwords.html
  • Yu, H. and Eugene Agichtein (2003) Extracting synonymous gene and protein terms from biological literature. Bioinformatics, 19, i340-i349.
  • Zhou, G., Zhang, J., Su, J., Shen, D. and Tan, C. (2004). Recognizing names in biomedical texts: a machine learning approach. Bioinformatics, 20, 1178-90.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 UnsupGeneProtNENormlAaron M. CohenUnsupervised Gene/Protein Named Entity Normalization Using Automatically Extracted DictionariesProceedings of the ACL-ISMB Workshop on Linking Biological Literaturehttp://acl.ldc.upenn.edu/W/W05/W05-1303.pdf2005