2007 nAryMultiSentencePPLRE

Much of the world’s biomedical knowledge is contained in the natural language text within research papers that are increasingly becoming available online. Applications that have begun to tap this knowledge include information extraction and question answering algorithms, but these algorithms require effective approaches to recognize semantic relations between entity mentions. As has occurred in other natural language processing tasks, such as named entity recognition, approaches to relation recognition have evolved over time from knowledge engineering heuristic-based ones [2] to those that apply supervised machine learning algorithms to the task [1,5,8,10,11,13,14]. Early supervised learning approaches used bag-of-word representations of the document [14], then quickly proceeded to analyze shallow sequence representations [1], and most recently the emphasis has been on full syntactic parsing of each sentence [8,10,13].

While the performance of supervised relation detection has improved significantly since initial proposals [11], many more advances in the field are required before human levels of competency are attained. State-of-the-art performance on the protein/gene interaction task is currently 75% F-measure [5], but this performance was attained on binary relations and the evaluation does not include the missed relations where entity mentions reside on separate sentences. Research in NLP that has looked at information in multiple sentences has focused on the topics of co-reference, and more specifically, in entity detection and tracking across sentences. However, such research has not yet been used in combination with state-of-the-art approaches to relation detection, especially those that use state-of-the-art features. As a motivating example, consider the following passage composed of three sentences: “The pilus LOCATION of V. choleraeORGANISM is essential for intestinal colonization. The pilus LOCATION biogenesis apparatus is composed of at least nine proteins. TcpC PROTEIN is an outer membrane LOCATION lipoprotein required for pilus LOCATION biogenesis.” To our knowledge no supervised relation recognition algorithm can currently identify the ternary relation between the organism in the first sentence, and the subcellular location and protein in the third sentence. This relation would go undetected by current information extraction algorithms.

Our work aims to address this scenario in order to improve the Recall and F-measure of relation recognition methods. We propose a framework that subsumes the representation used by state-of-the-art approaches when applied to the detection of binary relations within a single sentence. The framework is centered on a text-graph representation that includes intersentential edges. We illustrate that the generation of relation cases when dealing with multi-sentence passages can significantly increase the proportion of false relation cases from which to construct a classification model, but that our approach copes with this increase in negative cases.

The remainder of the paper is structured as follows: The next section defines the more general task of semantic relation detection considered in this paper and summarizes the current challenges that motivate further research into the topic. The text-graph framework and relation case generation are then described in detail along with the feature space that generalizes existing methods are introduced. Finally, we present the empirical results of experiments performed on the task of recognizing subcellular localizations within PubMed abstracts.

…

Conclusions

This paper addresses the challenge of recognizing mentions of relations with more than two arguments, where the argument’s entity mentions can be located in different sentences. A motivating example is the ability to identify subcellular localization relations in biomedical research abstracts. For this ternary relation a large proportion of relation cases appear outside of the single sentence boundary. In general the more arguments in the semantic relation the more likely it will be that the relation is spread beyond a single sentence. To support these more complex relation detection scenarios we proposed a text-graph representation of the entire document. As in state-of-the-art supervised algorithms the intrasentential graph edges are derived from each sentence’s syntactic parse trees. For intersentential edges we propose linking adjacent sentence edges and also, if available, entities that are identified as coreferents by a coreference resolution process. Compared to three baseline algorithms, the proposed approach achieves competitive F-measure and Recall performance. The paper suggests several avenues of future research into the area of detecting n-ary relations across multiple sentences. We plan to explore the question of adding more contextual features. It will also be instructive to explore the challenges that will arise when the framework is applied to full papers (rather than abstracts only) which will generate much larger and sparser text graphs.

References

1. Agichtein E, Gravano L: Snowball: Extracting Relations from Large Plain-Text Collections. Procs. of the 5th ACM International Conference on Digital Libraries; 2000.
2. Appelt DE, Hobbs JR, Bear J, Israel DJ, Tyson M: FASTUS: A Finite-state Processor for Information Extraction from Real-world Text. Procs. of IJCAI; 1993.
3. Craven M, Kumlien J: Constructing biological knowledge-bases by extracting information from text sources. Procs. of the Seventh International Conference on Intelligent Systems for Molecular Biology; 1999.
4. Frietag D, McCallum A: Information Extraction with HMMs and Shrinkage. AAAI'99 Workshop on Machine Learning for Information Extraction; 1999.
5. Fundel K, Kuffner F, Zimmer R: RelEx--relation extraction using dependency parse trees. Bioinformatics, 23(3):365-71; 2007.
6. Gardy JL, Laird MR, Chen F, Rey S, Walsh CJ, Ester M, Brinkman FSL: PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics. 21(5):617-23; 2005.
7. Hoglund A, Blum T, Brady S, Donnes P, San Miguel J, Rocheford M, Kohlbacher O, Shatkay H: Significantly improved prediction of subcellular localization by integrating text and protein sequence data. Pacific Symposium on Biocomputing; 2006.
8. Jiang J, Zhai C: A Systematic Exploration of the Feature Space for Relation Extraction, Procs. of NAACL/HLT; 2007
9. Kambhatla N: Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. Proceedings of ACL; 2004.
10. Liu Y, Shi Z, Sarkar A: Exploiting Rich Syntactic Information for Relation Extraction from Biomedical Articles. Procs. of NAACL/HLT; 2007.
11. S. Miller, M. Crystal, H. Fox, L. Ramshaw, R. Schwartz, R. Stone, R. Weischedel, and the Annotation Group. (1998). Algorithms that learn to extract information BBN: Description of the SIFT system as used for MUC-7. Procs of MUC-7.
12. Rey S, Acab M, Gardy JL, Laird MR, deFays K, Lambert C, Brinkman FSL: PSORTdb: a protein subcellular localization database for bacteria. Nucleic Acids Research 33:D164-168; 2005.
13. Shi Z, Sarkar A, Popowich F: Simultaneous Identification of Biomedical Named-Entity and Functional Relation Using Statistical Parsing Techniques. Procs. of NAACL/HLT; 2007.
14. Stapley BJ, Benoit G: Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. Pacific Symposium on Biocomputing; 2000.
15. Stapley BJ, Kelley LA, Sternberg MJ: Predicting the sub-cellular location of proteins from text using support vector machines. Pacific Symposium on Biocomputing; 2002.
16. Skounakis M, Craven M, Ray S: Hierarchical Hidden Markov Models for Information Extraction. Procs. of IJCAI; 2003.
17. Zhang M, Zhang J, Su J: Exploring Syntactic Features for Relation Extraction using a Convolution Tree Kernel. Procs. of NAACL/HLT-2006; 2006.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2007 nAryMultiSentencePPLRE	Martin Ester Gabor Melli Anoop Sarkar			Recognition of Multi-sentence n-ary Subcellular Localization Mentions in Biomedical Abstracts		Proceedings of the 2nd International Symposium on Languages in Biology and Medicine	http://ceur-ws.org/Vol-319/Paper2.pdf			2007

2007 nAryMultiSentencePPLRE

Notes

Cited By

2010

2008

Quotes

Abstract

Background

Results

Conclusions

Background

Conclusions

References

Navigation menu

Search