2007 nAryMultiSentencePPLRE

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Complex Semantic Relation Mention Recognition Algorithm, TeGRR.

Notes

Cited By

2010

2008

Quotes

Abstract

Background

Research into semantic relation recognition from text has focused on the identification of binary relations that are contained within one sentence. In the domain of biomedical documents however relations of interest can have more than two arguments and can also have their entity mentions located on different sentences. An example of this scenario is the ternary relation of “subcellular localization” which relates whether an organism’s (O) protein (P) has subcellular location (L) as one of its target destinations. Empirical evidence suggests that approximately one half of the mentions for this ternary relation reside on multi-sentence passages.

Results

We introduce a relation recognition algorithm that can detect n-ary relations across multiple sentences in a document, and use the subcellular localization relation as a motivating example. The approach uses a text-graph representation of the entire document that is based on intrasentential edges derived from each sentence’s predicted syntactic parse trees, and on intersentential edges based on either the linking of adjacent sentences or the linking of coreferents, if reliable coreference predictions are available. From the text graph state-of-the-art features such as named-entity features and syntactic features are produced for each argument pairing. We test the approach on the task of recognizing, in PubMed abstracts, experimentally validated subcellular localization relations that have been curated by biomedical researchers. When tested against several baseline algorithms, our approach is shown to attain the highest F-measure.

Conclusions

We present a method that naturally supports the recognition of semantic relations with more than two arguments and whose mentions can reside across multiple sentences. The algorithm accelerated the extraction of experimentally validated subcellular localizations. Given that the corpus is based on abstracts, not copyrighted papers, the data is publicly available from koch.pathogenomics.ca/pplre/. Significant work remains to approximate human expert levels of performance. We hypothesize that additional features are required that provide contextual information from elsewhere in the document about whether the relation refers to an experimentally validated finding.

Background

Much of the world’s biomedical knowledge is contained in the natural language text within research papers that are increasingly becoming available online. Applications that have begun to tap this knowledge include information extraction and question answering algorithms, but these algorithms require effective approaches to recognize semantic relations between entity mentions. As has occurred in other natural language processing tasks, such as named entity recognition, approaches to relation recognition have evolved over time from knowledge engineering heuristic-based ones [2] to those that apply supervised machine learning algorithms to the task [1,5,8,10,11,13,14]. Early supervised learning approaches used bag-of-word representations of the document [14], then quickly proceeded to analyze shallow sequence representations [1], and most recently the emphasis has been on full syntactic parsing of each sentence [8,10,13].

While the performance of supervised relation detection has improved significantly since initial proposals [11], many more advances in the field are required before human levels of competency are attained. State-of-the-art performance on the protein/gene interaction task is currently 75% F-measure [5], but this performance was attained on binary relations and the evaluation does not include the missed relations where entity mentions reside on separate sentences. Research in NLP that has looked at information in multiple sentences has focused on the topics of co-reference, and more specifically, in entity detection and tracking across sentences. However, such research has not yet been used in combination with state-of-the-art approaches to relation detection, especially those that use state-of-the-art features. As a motivating example, consider the following passage composed of three sentences: “The pilus LOCATION of V. choleraeORGANISM is essential for intestinal colonization. The pilus LOCATION biogenesis apparatus is composed of at least nine proteins. TcpC PROTEIN is an outer membrane LOCATION lipoprotein required for pilus LOCATION biogenesis.” To our knowledge no supervised relation recognition algorithm can currently identify the ternary relation between the organism in the first sentence, and the subcellular location and protein in the third sentence. This relation would go undetected by current information extraction algorithms.

Our work aims to address this scenario in order to improve the Recall and F-measure of relation recognition methods. We propose a framework that subsumes the representation used by state-of-the-art approaches when applied to the detection of binary relations within a single sentence. The framework is centered on a text-graph representation that includes intersentential edges. We illustrate that the generation of relation cases when dealing with multi-sentence passages can significantly increase the proportion of false relation cases from which to construct a classification model, but that our approach copes with this increase in negative cases.

The remainder of the paper is structured as follows: The next section defines the more general task of semantic relation detection considered in this paper and summarizes the current challenges that motivate further research into the topic. The text-graph framework and relation case generation are then described in detail along with the feature space that generalizes existing methods are introduced. Finally, we present the empirical results of experiments performed on the task of recognizing subcellular localizations within PubMed abstracts.

Conclusions

This paper addresses the challenge of recognizing mentions of relations with more than two arguments, where the argument’s entity mentions can be located in different sentences. A motivating example is the ability to identify subcellular localization relations in biomedical research abstracts. For this ternary relation a large proportion of relation cases appear outside of the single sentence boundary. In general the more arguments in the semantic relation the more likely it will be that the relation is spread beyond a single sentence. To support these more complex relation detection scenarios we proposed a text-graph representation of the entire document. As in state-of-the-art supervised algorithms the intrasentential graph edges are derived from each sentence’s syntactic parse trees. For intersentential edges we propose linking adjacent sentence edges and also, if available, entities that are identified as coreferents by a coreference resolution process. Compared to three baseline algorithms, the proposed approach achieves competitive F-measure and Recall performance. The paper suggests several avenues of future research into the area of detecting n-ary relations across multiple sentences. We plan to explore the question of adding more contextual features. It will also be instructive to explore the challenges that will arise when the framework is applied to full papers (rather than abstracts only) which will generate much larger and sparser text graphs.

References

  • 1. Agichtein E, Gravano L: Snowball: Extracting Relations from Large Plain-Text Collections. Procs. of the 5th ACM International Conference on Digital Libraries; 2000.
  • 2. Appelt DE, Hobbs JR, Bear J, Israel DJ, Tyson M: FASTUS: A Finite-state Processor for Information Extraction from Real-world Text. Procs. of IJCAI; 1993.
  • 3. Craven M, Kumlien J: Constructing biological knowledge-bases by extracting information from text sources. Procs. of the Seventh International Conference on Intelligent Systems for Molecular Biology; 1999.
  • 4. Frietag D, McCallum A: Information Extraction with HMMs and Shrinkage. AAAI'99 Workshop on Machine Learning for Information Extraction; 1999.
  • 5. Fundel K, Kuffner F, Zimmer R: RelEx--relation extraction using dependency parse trees. Bioinformatics, 23(3):365-71; 2007.
  • 6. Gardy JL, Laird MR, Chen F, Rey S, Walsh CJ, Ester M, Brinkman FSL: PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics. 21(5):617-23; 2005.
  • 7. Hoglund A, Blum T, Brady S, Donnes P, San Miguel J, Rocheford M, Kohlbacher O, Shatkay H: Significantly improved prediction of subcellular localization by integrating text and protein sequence data. Pacific Symposium on Biocomputing; 2006.
  • 8. Jiang J, Zhai C: A Systematic Exploration of the Feature Space for Relation Extraction, Procs. of NAACL/HLT; 2007
  • 9. Kambhatla N: Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. Proceedings of ACL; 2004.
  • 10. Liu Y, Shi Z, Sarkar A: Exploiting Rich Syntactic Information for Relation Extraction from Biomedical Articles. Procs. of NAACL/HLT; 2007.
  • 11. S. Miller, M. Crystal, H. Fox, L. Ramshaw, R. Schwartz, R. Stone, R. Weischedel, and the Annotation Group. (1998). Algorithms that learn to extract information BBN: Description of the SIFT system as used for MUC-7. Procs of MUC-7.
  • 12. Rey S, Acab M, Gardy JL, Laird MR, deFays K, Lambert C, Brinkman FSL: PSORTdb: a protein subcellular localization database for bacteria. Nucleic Acids Research 33:D164-168; 2005.
  • 13. Shi Z, Sarkar A, Popowich F: Simultaneous Identification of Biomedical Named-Entity and Functional Relation Using Statistical Parsing Techniques. Procs. of NAACL/HLT; 2007.
  • 14. Stapley BJ, Benoit G: Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. Pacific Symposium on Biocomputing; 2000.
  • 15. Stapley BJ, Kelley LA, Sternberg MJ: Predicting the sub-cellular location of proteins from text using support vector machines. Pacific Symposium on Biocomputing; 2002.
  • 16. Skounakis M, Craven M, Ray S: Hierarchical Hidden Markov Models for Information Extraction. Procs. of IJCAI; 2003.
  • 17. Zhang M, Zhang J, Su J: Exploring Syntactic Features for Relation Extraction using a Convolution Tree Kernel. Procs. of NAACL/HLT-2006; 2006.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 nAryMultiSentencePPLREMartin Ester
Gabor Melli
Anoop Sarkar
Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical AbstractsProceedings of the 2nd International Symposium on Languages in Biology and Medicinehttp://ceur-ws.org/Vol-319/Paper2.pdf2007