2008 LearningSpeciesOfBiomedNEs

From GM-RKB
Jump to: navigation, search

Subject Headings: Organism Component Semantic Relation Recognition Task, ITI TXM Corpora, Organism Mention Normalization Task, Organism NER.

Notes

Ideas

  • I was surprised to see that the Binary Relation between an Organism and its constituents was still an open problem!
  • Our advantages include that:
    • we treat the problem as a general relation extraction one.
    • our features are neither custom nor manually constructed.

Cited By

Quotes

Abstract

  • In biomedical articles, terms with the same surface forms are often used to refer to different entities across a number of model organisms, in which case determining the species becomes crucial to term identification systems that ground terms to specific database identifiers. This paper describes a rule-based system that extracts “species indicating words”, such as human or murine, which can be used to decide the species of the nearby entity terms, and a machine-learning species disambiguation system that was developed on manually species-annotated corpora. Performance of both systems were evaluated on gold-standard datasets, where the machine-learning system yielded better overall results.

Introduction

  • Information Extraction (IE) technologies such as Named Entity Recognition (NER), Term Identification (TI) and Relation Extraction (RE) have been shown to help reduce the laborious work involved in curating the vast amount of biomedical research papers (Karamanis et al., 2007; Alex et al., 2008; Wang and Matthews, 2008). In a typical curation process for protein-protein interactions (PPI), an IE system would first recognise the protein mentions (i.e., NER) and then assign unique database identifiers to them (i.e., TI), and finally input the pairs of identifiers of the interacting proteins into a database of PPIs. As an intermediate module that disambiguates the mentions and normalises them to database identifiers, TI is essential because strings of text with the same surface form can often be used to refer to different entities. As noted in our previous work and elsewhere, determining the correct species for the protein mentions is one of the most important steps towards TI (Krauthammer and Nenadic, 2004; Chen et al., 2005; Krallinger et al., 2007; Wang, 2007).
  • We found that Plk1 can phosphorylate Nek2 in vitro and interacts with Nek2 in vivo.
  • For example, searching for the string plk1 in the above sentence

in RefSeq1 resulted in 98 hits, whereas when a species (e.g., mouse (Mus musculus) was added to the query, we were able to narrow down the number of choices to two. This paper reports on our efforts in building speciesannotated corpora, in which species tags were manually assigned to occurrences of several types of biomedical entities, including proteins, genes and mRNAs, and in developing an automatic species tagger using rule-based and machine-learning approaches based on this resource.

Related Work

  • The BioCreAtIvE I & II evaluation workshops (Hirschman

et al., 2005; Hirschman et al., 2007) provided forums and gold-standard datasets for the community on evaluating biomedical IE systems such as NER (Yeh et al., 2005; Wilbur et al., 2007), TI (Hirschman et al., 2004; Morgan and Hirschman, 2007), and RE (Blaschke et al., 2005; Krallinger et al., 2007). A number of tasks in the recent BioCreAtIvE II workshop have addressed the importance of species disambiguation. For example, the protein interaction pairs subtask (IPS) (Krallinger et al., 2007) resembled the work-flow of manual curation of PPIs, (The curation task that we refer to here requires curators to identify examples of protein-protein interactions in biomedical literature, which is a laborious task requiring considerable expertise.) and required identification of interacting proteins across many model organisms. The best result for this task was fairly low at 28.85% F1, and a number of participants have reported (e.g., Grover et al., 2007) that species ambiguity posed one of the biggest challenges. An analysis of the training dataset of IPS revealed that the interacting proteins in this corpus belong to over 60 species, and only 56.27% of them are human.

  • Also, Chen et al. (2005) collected gene information from 21 organisms and quantified naming ambiguities within species, cross species, with English words and with medical terms. Their study showed that the intra-species ambiguity in gene names was negligible at 0.02%, whereas cross-species ambiguity was high at 14.2%. It suggests that resolving species ambiguity would be an effective step towards gene name identification. On the other hand, as Ananiadou et al. (2004). suggested, existing text processing resources typically lack information that can support disambiguation of terms, and such resources do not address ambiguities related to finer biological classification, such as species information.
  • Our previous work (Wang, 2007) reported initial results of a species disambiguation system and the performance of TI with the system integrated. The accuracy of species tagging was 56.0% as tested by 10-fold cross validation on the training data and was 75.0% on the development test data. This species tagging component also improved the performance of a rule-based TI system by 10%. Note that those experiments were conducted on a different dataset using a different species ontology from the ones reported in this paper, and therefore the results are not comparable to those presented in this paper.

Rule-based Approach

  • It is intuitive that a species word that occurs right before an

entity mention (e.g., mouse p53) should be a strong indicator of its species. To assess how well this intuition works, we developed a rule-based system using the heuristic and species words detected by the species word tagger. We devised four rule-based systems.

Machine Learning Approach

  • We also conducted research on machine-learning approaches

to species tagging. First, we paired up a vector of contextual features with every entity mention in the training splits of the EPPI and TE data. Then, a number of Maximum Entropy models7 were trained on such instances.

Conclusions and Future Work

  • We adopted the species annotated corpora developed in the TXM project and investigated various techniques for assigning species tags to biomedical entities. We found that the common heuristic of tagging an entity with the species indicated by its previous species word was not reliable: as tested on our EPPI and TE datasets, this heuristic achieved good precision of 81.88% and 91.49%, but very low recall.

Subsequently, we experimented with a machine-learning based approach and with a large set of features and with different parameter settings. Our best results were much higher than the rule-based system, with F1 scores over 71%. The problem with the machine-learning based approach, however, is that distribution of species in the training data has a big impact on the model trained on it. In other words, a species tagger trained on a corpus dominated by human would have little chance of achieving good results on a test corpus full of zebra fish. Increasing the size andcoverage of the training data is an obvious solution and our experiment showed that a model trained on a combined set of data from both the EPPI and the TE domains achieved very good results on devtest dataset from either domain. In the future we would like to explore how we can seek help from specific rules in situations when machine-learning models would not work. For example, in articles that are not talking about the common species such as human, fly and mouse, specific rules making use of species words might work better for detecting the species of the biomedical entities.

  • In addition, we would like to integrate the species tagger into term identification and relation extraction systems, making them capable of dealing with biomedical entities across multiple species.

References

  • B. Alex, C. Grover, B. Haddow, M. Kabadjov, E. Klein,

M. Matthews, S. Roebuck, R. Tobin, and X.Wang. 2008. Assisted curation: does text mining really help? In Proceedings of the Pacific Symposium on Biocomputing (PSB). Sophia Ananiadou, C. Friedman, and Jun'ichi Tsujii. (2004). Introduction: Named entity recognition in biomedicine. Journal of Biomedical Informatics, 37(6):393–395. C. Blaschke, E. A. Leon, M. Krallinger, and A. Valencia. 2005. Evaluation of BioCreAtIvE assessment of task 2. BMC Bioinformatics, 6(Suppl 1:S16). L. Chen, H. Liu, and C. Friedman. (2005). Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics, 21(2):248–256.

  • C. Grover, B. Haddow, E. Klein, M. Matthews, L. A.

Nielsen, R. Tobin, and X. Wang. (2007). Adapting a relation extraction pipeline for the BioCreAtIvE II task. In: Proceedings of the BioCreAtIvE II Workshop 2007, Madrid. L. Hirschman, Marc E. Colosimo, A. Morgan, J. Columbe, and A. Yeh. (2004). Task 1B: Gene list task BioCreAtIve workshop. In BioCreative: Critical Assessment for Information Extraction in Biology. L. Hirschman, A. Yeh, Christian Blaschke, and A. Valencia. 2005. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics, 6(Suppl1):S1. L. Hirschman, M. Krallinger, and A. Valencia, editors. 2007. Second BioCreative Challenge Evaluation Workshop. Fundaci´on CNIO Carlos III, Madrid, Spain. N. Karamanis, I. Lewin, R. Seal, R. Drysdale, and E. Briscoe. (2007). Integrating natural language processing with FlyBase curation. In: Proceedings of PSB, pages 245–256, Maui, Hawaii. M. Krallinger, F. Leitner, and A. Valencia. (2007). Assessment of the second BioCreative PPI task: Automatic extraction of protein-protein interactions. In: Proceedings of the BioCreAtIvE II Workshop 2007, pages 41–54, Madrid, Spain. Michael Krauthammer and G. Nenadic. (2004). Term identification in the biomedical literature. Journal of Biomedical Informatics (Special Issue on Named Entity Recogntion in Biomedicine), 37(6):512–526.


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 LearningSpeciesOfBiomedNEsXinglong Wang
Claire Grover
Learning the Species of Biomedical Named Entities from Annotated CorporaProceedings of 6th Language Resources and Evaluation Conferencehttp://www.ltg.ed.ac.uk/np/publications/ltg/papers/Wang2008Learning.pdf2008