Keywords: Relation Detection from Text Algorithm, ACE Benchmark Task
Cited By
- [ZhangZS, 2006] => M. Zhang, J. Zhang, and J. Su. (2006). Exploring Syntactic Features for Relation Extraction using a Convolution Tree Kernel. In Proceedings of HLT-2006.
- "Zhao and Grishman (2005) define a feature based composite kernel to integrate diverse features. Their kernel displays very good performance on the 2004 version of ACE corpus. Since this is a feature-based kernel, all the features used in the kernel have to be explicitly enumerated. Similar with the feature-based method, they also represent the tree feature as a link path between two entities. Therefore, we wonder whether their performance improvement is mainly due to the explicitly incorporation of diverse linguistic features instead of the kernel method itself."
Quotes
Abstract
- "Entity relation detection is a form of information extraction that finds predefined relations between pairs of entities in text. This paper describes a relation detection approach that combines clues from different levels of syntactic processing using kernel methods. Information from three different levels of processing is considered: tokenization, sentence parsing and deep dependency analysis. Each source of information is represented by kernel functions. Then composite kernels are developed to integrate and extend individual kernels so that processing errors occurring at one level can be overcome by information from other levels. We present an evaluation of these methods on the 2004 ACE relation detection task, using Support Vector Machines, and show that each level of syntactic processing contributes useful information for this task. When evaluated on the official test data, our approach produced very competitive ACE value scores. We also compare the SVM with KNN on different kernels."
2 Kernel Methods
- "Second, kernel functions have many nice combination properties: for example, the sum or product of existing kernels is a valid kernel. This forms the basis for the approach described in this paper. With these combination properties, we can combine individual kernels representing information from different sources in a principled way.
- "Many classifiers can be used with kernels. The most popular ones are SVM, KNN, and voted perceptrons.
- "Given all the levels of features incorporated in kernels and training data with target examples labeled, an SVM can pick up the features that best separate the targets from other examples, no matter which level these features are from. In cases where an error occurs in one processing result (especially deep processing) and the features related to it become noisy, SVM may pick up clues from other sources which are not so noisy.
3. Related Work
- "Collins et al. (1997) and Miller et al. (2000) used statistical parsing models to extract relational facts from text, which avoided pipeline processing of data. However, their results are essentially based on the output of sentence parsing, which is a deep processing of text. So their approaches are vulnerable to errors in parsing. Collins et al. (1997) addressed a simplified task within a confined context in a target sentence.
- "Zelenko et al. (2003) described a recursive kernel based on shallow parse trees to detect personaffiliation and organization-location relations, in which a relation example is the least common subtree containing two entity nodes. The kernel matches nodes starting from the roots of two subtrees and going recursively to the leaves. For each pair of nodes, a subsequence kernel on their child nodes is invoked, which matches either contiguous or non-contiguous subsequences of node. Compared with full parsing, shallow parsing is more reliable. But this model is based solely on the output of shallow parsing so it is still vulnerable to irrecoverable parsing errors. In their experiments, incorrectly parsed sentences were eliminated.
- "Culotta and Sorensen (2004) described a slightly generalized version of this kernel based on dependency trees. Since their kernel is a recursive match from the root of a dependency tree down to the leaves where the entity nodes reside, a successful match of two relation examples requires their entity nodes to be at the same depth of the tree. This is a strong constraint on the matching of syntax so it is not surprising that the model has good precision but very low recall. In their solution a bag-of-words kernel was used to compensate for this problem. In our approach, more flexible kernels are used to capture regularization in syntax, and more levels of syntactic information are considered.
- "Kambhatla (2004) described a Maximum Entropy model using features from various syntactic sources, but the number of features they used is limited and the selection of features has to be a manual process.
- "In our model, we use kernels to incorporate more syntactic information and let a Support Vector Machine decide which clue is crucial. Some of the kernels are extended to generate high order features. We think a discriminative classifier trained with all the available syntactic features should do better on the sparse data.
4 Kernel Relation Detection
4.1 ACE Relation Detection Task
- "ACE (Automatic Content Extraction)2 is a research and development program in information extraction sponsored by the U.S. Government.
- "The 2004 evaluation defined seven major types of relations between seven types of entities. The entity types are PER (Person), ORG (Organization), FAC (Facility), GPE (Geo-Political Entity: countries, cities, etc.), LOC (Location), WEA (Weapon) and VEH (Vehicle).
- "Each mention of an entity has a mention type: NAM (proper name), NOM (nominal) or PRO (pronoun); for example George W. Bush, the president, and he respectively.
- "The seven relation types are EMP-ORG (Employment/ Membership/Subsidiary), PHYS (Physical), PER-SOC (Personal/Social), GPE-AFF (GPEAffiliation), Other-AFF (Person/ORG Affiliation), ART (Agent-Artifact) and DISC (Discourse).
- "There are also 27 relation subtypes defined by ACE, but this paper only focuses on detection of
relation types.
4.2 Definitions
- In our model, kernels incorporate information from tokenization, parsing and deep dependency analysis.
- A relation candidate R is defined as R = (arg1, arg2, seq, link, path), where:
- "arg1 and arg2 are the two entity arguments which may be related;
- "seq=(t1, t2, …, tn) is a token vector that covers the arguments and intervening words;
- "link=(t1, t2, …, tm) is also a token vector, generated from seq and the parse tree;
- "path is a dependency path connecting arg1 and arg2 in the dependency graph produced by GLARF. path can be empty if no such dependency path exists.
- The difference between link and seq is that link only retains the “important” words in seq in terms of syntax. For example, all noun phrases occurring in seq are replaced by their heads. Words and constituent types in a stop list, such as time expressions, are also removed.
- "A token T is defined as a string triple, T = (word, pos, base), where word, pos and base are strings representing the word, part-of-speech and morphological base form of T. Entity is a token augmented with other attributes,
- E = (tk, type, subtype, mtype), where:
- tk is the token associated with E;
- type, subtype and mtype are strings representing the entity type, subtype and mention type of E. The subtype contains more specific information about an entity. For example, for a GPE entity, the subtype tells whether it is a country name, city name and so on. Mention type includes NAM, NOM and PRO.
- "It is worth pointing out that we always treat an entity as a single token: for a nominal, it refers to its head, such as boys in the two boys; for a proper name, all the words are connected into one token, such as Bashar_Assad. So in a relation example R whose seq is (t1, t2, …, tn), it is always true that arg1=t1 and arg2=tn. For names, the base form of an entity is its ACE type (person, organization, etc.). To introduce dependencies, we define a dependency token to be a token augmented with a vector of dependency arcs,
- DT=(word, pos, base, dseq), where dseq = (arc1, ... , arcn ). A dependency arc is
- "ARC = (w, dw, label, e), where w is the current token; dw is a token connected by a dependency to w; and label and e are the role label and direction of this dependency arc respectively. From now on we upgrade the type of tk in arg1 and arg2 to be dependency tokens. Finally, path is a vector of dependency arcs, path = (arc1 , ... , arcl ), where l is the length of the path and arci (1≤i≤l) satisfies arc1.w=arg1.tk, arci+1.w=arci.dw and arcl.dw=arg2.tk. So path is a chain of dependencies connecting the two arguments in R. The arcs in it do not have to be in the same direction.
4.3 Syntactic Kernels
- "In this section we will describe the kernels designed for different syntactic sources and explain the intuition behind them.
- "We define two kernels to match relation examples at surface level. Using the notation just defined, we can write the two surface kernels as follows:
1) Argument kernel
- "Kernel Ψ1 matches attributes of two entity arguments respectively, such as type, subtype and lexical head of an entity. This is based on the observation that there are type constraints on the two arguments. For instance PER-SOC is a relation mostly between two person entities. So the attributes of the entities are crucial clues.
2) Bigram kernel
- "So Ψ2 is a kernel that simply matches unigrams and bigrams between the seq sequences of two relation examples. The information this kernel provides is faithful to the text.
3) Link sequence kernel
- "Ψ3 is a kernel that matches token by token between the link sequences of two relation examples. Since relations often occur in a short context, we expect many of them have similar link sequences.
4) Dependency-path Kernel
- "Intuitively the dependency path connecting two arguments could provide a high level of syntactic regularization. However, a complete match of two dependency paths is rare. So this kernel matches the component arcs in two dependency paths in a pairwise fashion. Two arcs can match only when they are in the same direction. In cases where two paths do not match exactly, this kernel can still tell us how similar they are. In our experiments we placed an upper bound on the length of dependency paths for which we computed a non-zero kernel.
5) Local-dependency Kernel
- "This kernel matches the local dependency context around the relation arguments. This can be helpful especially when the dependency path between arguments does not exist. We also hope the dependencies on each argument may provide some useful clues about the entity or connection of the entity to the context outside of the relation example.
4.4 Compositive Kernels
- "All the individual kernels we designed are explicit. Each kernel can be seen as a matching of features and these features are enumerable on the given data. So it is clear that they are all valid kernels. Since the kernel function set is closed under linear combination and polynomial extension, the composite kernels are also valid. The reason we propose to use a feature-based kernel is that we can have a clear idea of what syntactic clues it represents and what kind of information it misses. This is important when developing or refining kernels, so that we can make them generate complementary information from different syntactic processing results.
6. Further Work
- "In another direction, training data is often sparse for IE tasks. String matching is not sufficient to capture semantic similarity of words. One solution is to use general purpose corpora to create clusters of similar words; another option is to use available resources like WordNet. These word similarities can be readily incorporated into the kernel framework.
- "To deal with sparse data, we can also use deeper text analysis to capture more regularities from the data. Such analysis may be based on newly-annotated corpora like PropBank (Kingsbury and Palmer, 2002) at the University of Pennsylvania and NomBank (Meyers et al., 2004) at New York University. Analyzers based on these resources can generate regularized semantic representations for lexically or syntactically related sentence structures. Although deeper analysis may even be less accurate, our framework is designed to handle this and still obtain some improvement in performance.
References
- M. Collins and S. Miller. 1997. Semantic tagging using a probabilistic context free grammar. In Proceedings of the 6th Workshop on Very Large Corpora.
- N. Cristianini and J. Shawe-Taylor. 2000. An introduction to support vector machines. Cambridge University Press.
- [Culotta and Sorensen, 2004] => A. Culotta and J. S. Sorensen. (2004). Dependency Tree Kernels for Relation Extraction. In Proc. of ACL-2004.
- D. Gildea and M. Palmer. 2002. The Necessity of Parsing for Predicate Argument Recognition. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
- [Kambhatla, 2004] => N. Kambhatla. (2004). Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations. Poster. In Proc. ACL-2004
- P. Kingsbury and M. Palmer. 2002. From treebank to propbank. In Proceedings of the 3rd International
Conference on Language Resources and Evaluation (LREC-2002). - C. D. Manning and H. Schutze 2002. Foundations of Statistical Natural Language Processing. The MIT Press, page 454-455.
- A. Meyers, R. Grishman, M. Kosaka and S. Zhao. 2001. Covering Treebanks with GLARF. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics.
- A. Meyers, R. Reeves, Catherine Macleod, Rachel Szekeley, Veronkia Zielinska, Brian Young, and R. Grishman. 2004. The Cross-Breeding of Dictionaries. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC-2004).
- [MillerFRW, 2000] => S. Miller, H. Fox, L. Ramshaw, and R. Weischedel. (2000). http://portal.acm.org/citation.cfm?id=974335">A novel use of statistical parsing to extract information from text. In Proc. NAACL-2000.
- K.-R. Müller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf. 2001. An introduction to kernel-based learning algorithms, IEEE Trans. Neural Networks, 12, 2, pages 181-201.
- V. N. Vapnik. 1998. Statistical Learning Theory. Wiley-Interscience Publication.
- D. Zelenko, C. Aone and A. Richardella. 2003. Kernel methods for relation extraction. Journal of Machine
Learning Research. - Shubin Zhao, Adam Meyers, Ralph Grishman. 2004. Discriminative Slot Detection Using Kernel Methods. In the Proceedings of the 20th International