2007 ExploitingRichSyntInfoForRelExt

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Relation Recognition from Text Task, Semantic Role Labeling Application

Notes

Cited By

  • ~20

Quotes

Abstract

This paper proposes a ternary relation extraction method primarily based on rich syntactic information. We identify PROTEIN-ORGANISM-LOCATION relations in the text of biomedical articles. Different kernel functions are used with an SVM learner to integrate two sources of information from syntactic parse trees: (i) a large number of syntactic features that have been shown useful for Semantic Role Labeling (SRL) and applied here to the relation extraction task, and (ii) features from the entire parse tree using a tree kernel. Our experiments show that the use of rich syntactic features significantly outperforms shallow word-based features. The best accuracy is obtained by combining SRL features with tree kernels.

2 SRL Features for Information Extraction

we take the PROTEIN name in the role of the predicate (verb) and the ORGANISM/LOCATION name as its argument candidates in question. Then the problem of identifying the binary relations of PO and PL has been reduced to the problem of argument classification problem given the predicate and the argument candidates. The reason we pick PROTEIN names as predicates is that we assume PROTEIN names play a more central role in linking the binary relations to the final ternary relations.

3 System Description

The Syntactic Annotator parses the sentences and inserts the head information to the parse trees by using the Magerman/Collins head percolation rules.

Table 1: Features adopted from the SRL task. PRO: PROTEIN; ORG: ORGANISM

  • each word and its Part-of-Speech (POS) tag of PRO name
  • head word (hw) and its POS of PRO name
  • subcategorization that records the immediate structure that expands from PRO name. Non-PRO daughters will be eliminated
  • POS of parent node of PRO name
  • head word and its POS of the parent node of PRO name
  • each word and its POS of ORG name (in the case of “PO ” relation extraction).
  • head word and its POS of ORG name
  • POS of parent node of ORG name
  • head word and its POS of the parent node of ORG name
  • POS of the word immediately before/after ORG name
  • punctuation immediately before/after ORG name
  • feature combinations:
    • head word of PRO name, hw of ORG name
    • hw of PRO name, POS of head word of ORG name
    • POS of hw of PRO name, POS of head word of ORG name
  • path from PRO name to ORG name and the length of the path
  • trigrams of the path. We consider up to 9 trigrams
  • lowest common ancestor node of PRO name and ORG name along the path
  • LCA (Lowest Common Ancestor) path that is from ORG name to its lowest common ancestor with PRO name
  • relative position of PRO name and ORG name. In parse trees, we consider 4 types of positions that ORGs are relative to PROs: before, after, inside, other

Table 2: New features used in the SRL-based relation extraction system.

  • subcategorization that records the immediate structure that expands from ORG name. Non-ORG daughters will be eliminated
  • if there is an VP node along the path as ancestor of ORG name
  • if there is an VP node as sibling of ORG name
  • path from PRO name to LCA and the path length (L1)
  • path from ORG name to LCA and the path length (L2)
  • combination of L1 and L2
  • sibling relation of PRO and ORG
  • distance between PRO name and ORG name in the sentence. (3 valued: 0 if nw (number of words) = 0; 1 if 0 < nw <= 5; 2 if nw > 5)
  • combination of distance and sibling relation

4. Experiments and Evaluation

Baseline1 is a purely word-based system, where the features consist of the unigrams and bigrams between the PROTEIN name and the ORGANISM/LOCATION names inclusively, where the stopwords are selectively eliminated.

Baseline2 is a naive approach that assumes that any example containing PROTEIN, LOCATION names has the PL relation. The same assumption is made for PO and POL relations.

PAK system uses predicate-argument structure kernel (PAK) based method. PAK was defined in (Moschitti, 2004) and only considers the path from the predicate to the target argument, which in our setting is the path from the PROTEIN to the ORGANISM or LOCATION names.

SRL is an SRL system which is adapted to use our new feature set. A default linear kernel is applied with SVM learning.

TRK system is similar to PAK system except that the input is an entire parse tree instead of a PAK path.

TRK+SRL combines full parse trees and manually extracted features and uses the kernel combination.

5 Conclusion

In this paper we explored the use of rich syntactic features for the relation extraction task. In contrast with the previously used set of syntactic features for this task, we use a large number of features originally proposed for the Semantic Role Labeling task. We provide comprehensive experiments using many different models that use features from parse trees. Using rich syntactic features by combining SRL features with tree kernels over the entire tree obtains 71.8% accuracy which significantly outperforms shallow word-based features which obtains 56.3% accuracy.

References

  • Christian Blaschke, M. Andrade, C. Ouzounis, and A. Valencia. (1999). Automatic extraction of biological information from scientific text: Protein-protein interactions. In AAAI-ISMB 1999.
  • Razvan C. Bunescu and Raymond Mooney. (2005). A shortest path dependency kernel for relation extraction. In: Proceedings of HLT/EMNLP-2005.
  • G. Claudio, A. Lavelli, and L. Romano. (2006). Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature. In: Proceedings of EACL 2006.
  • Aron Culottaand J. Sorensen. (2004). Dependency tree kernels for relation extraction. In: Proceedings of ACL 2004.
  • Ronen Feldman, Y. Regev, M. Finkelstein-Landau, E. Hurvitz, and B. Kogan. (2002). Mining biomedical literature using information extraction. Current Drug Discovery.
  • Nanda Kambhatla. (2004). Combining lexical, syntactic, and semantic features with maximum entropy models for information extraction. In: Proceedings of ACL 2004 (poster session).
  • S. Miller, H. Fox, L. Ramshaw, and R. Weischedel. (2000). A novel use of statistical parsing to extract information from text. Proceedings of NAACL-2000.
  • Alessandro Moschitti. (2004). A study on convolution kernels for shallow semantic parsing. In: Proceedings of ACL 2004.
  • T. Sekimizu, H.S. Park, and Jun'ichi Tsujii. (1998). Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts. In Genome Informatics. 62-71.
  • Z. Shi, Anoop Sarkar, and F. Popowich. (2007). Simultaneous Identification of Biomedical Named-Entity and Functional Relation UsingStatistical Parsing Techniques. In NAACL-HLT 2007 (short paper).
  • D. Zelenko, C. Aone, and A. Richardella. (2003). Kernel methods for relation extraction. Journal of Machine Learning Research.
  • M. Zhang, J. Zhang, J. Su, and G.D. Zhou. (2006). A Composite Kernel to Extract Relations between Entities with Both Flat and Structured Features. In: Proceedings of ACL-2006.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 ExploitingRichSyntInfoForRelExtZhongmin Shi
Yudong Liu
Anoop Sarkar
Exploiting Rich Syntactic Information for Relation Extraction from Biomedical Articleshttp://acl.ldc.upenn.edu/N/N07/N07-2025.pdf