Keywords: Relation Recognition from Text Algorithm, Relational Data Kernel Function
Cited By
- [ZhangZS, 2006] => M. Zhang, J. Zhang, and J. Su. (2006). Exploring Syntactic Features for Relation Extraction using a Convolution Tree Kernel. In Proceedings of HLT-2006.
- "Zelenko et al. (2003) develop a tree kernel for relation extraction. Their tree kernel is recursively defined in a top-down manner, matching nodes from roots to leaf nodes. For each pair of matching nodes, a subsequence kernel on their child nodes is invoked, which matches either contiguous or sparse subsequences of node.
Quotes
Abstract
"We present an application of kernel methods to extracting relations from unstructured natural language sources. We introduce kernels defined over shallow parse representations of text, and design efficient algorithms for computing the kernels. We use the devised kernels in conjunction with Support Vector Machine and Voted Perceptron learning algorithms for the task of extracting person-affiliation and organization-location relations from text. We experimentally evaluate the proposed methods and compare them with feature-based learning algorithms, with promising results.
1. Introduction
- "Information extraction is an important unsolved problem of natural language processing (NLP). It is the problem of extracting entities and relations among them from text documents. Examples of entities are people, organizations, and locations. Examples of relations are person-affiliation and organization-location. The person-affiliation relation means that a particular person is affiliated with a certain organization. For instance, the sentence “John Smith is the chief scientist of the Hardcom Corporation” contains the person-affiliation relation between the person “John Smith” and the organization “Hardcom Corporation”. In this paper, we address the problem of extracting such relations from natural language text.
- "We propose a machine learning approach to relation extraction. The patterns for identifying relations are learned from a set of already extracted relations rather than written manually. We also present a novel methodology for adaptive information extraction based on kernel methods (Vapnik, 1998, Cristianini and Shawe-Taylor, 2000). Kernel methods have enjoyed successful applications for other related problems such as text categorization (Joachims, 2002) and bioinformatics (Furey et al., 2000). Recently, kernel methods exhibited excellent performance for natural language parsing (Collins and Duffy, 2001).
- "We believe that shallow parsing (Abney, 1990) is an important prerequisite for information extraction. Shallow parsing provides a fairly robust mechanism for producing text representation that can be effectively used for entity and relation extraction.
- "We formalize a relation extraction problem as a shallow parse classification problem in Section 4. A shallow parse is turned into an example whose label reflects whether a relation of interest is expressed by the shallow parse. The learning system uses the labeled examples to output a model that is applied to shallow parses to obtain labels, and thus extract relations.
- "We note that our learning methodology differs from the prevalent approach to information extraction, viz., probabilistic modeling (Bikel et al., 1999, Miller et al., 1998). In contrast to probabilistic modeling, our approach is inherently discriminative. That is, we do not seek to explain the underlying text probabilistically. Instead, we learn a model whose sole purpose is to separate instances of a particular relation from non-instances thereof.
- "A unique property of the kernel methods is that we do not explicitly generate features. More precisely, an example is no longer a feature vector as it is common in machine learning algorithms. Instead, examples retain their original representations (of shallow parses) and are used within learning algorithms only via computing a similarity (or kernel) function between them. Such a use of examples allows our learning system to implicitly explore a much larger feature space than one computationally feasible for processing with feature-based learning algorithms.
- "We briefly survey a number of approaches currently used for such natural language tasks as part of speech tagging and entity extraction. Hidden Markov Models (HMM) (Rabiner, 1990) have been perhaps the most popular approach for adaptive information extraction. HMMs exhibited excellent performance for name extraction (Bikel et al., 1999). Recently, HMM (with various extensions) have been applied to extraction of slots (“speaker”, “time”, etc.) in seminar announcements (Freitag and McCallum, 2000). HMMs are mostly appropriate for modeling local and flat problems. Relation extraction often involves modeling long range dependencies, for which HMM methodology is not directly applicable. Several probabilistic frameworks for modeling sequential data have recently been introduced to alleviate for HMM restrictions. We note Maximum Entropy Markov Models (MEMM) (McCallum et al., 2000) and Conditional Random Fields (CRF) (Lafferty et al., 2001). MEMMs are able to model more complex transition and emission probability distributions and take into account various text features. CRFs are an example of exponential models (Berger et al., 1996); as such, they enjoy a number of attractive properties (e.g., global likelihood maximum) and are better suited for modeling sequential data, as contrasted with other conditional models (Lafferty et al., 2001). They are yet to be experimentally validated for information extraction problems.
3. Kernel-based Machine Learning
- "Most learning algorithms rely on feature-based representation of objects. That is, an object is transformed
into a collection features f1, . . . , fN, thereby producing a N-dimensional vector. - "In many cases, data cannot be easily expressed via features. For example, in most NLP problems,
feature based representations produce inherently local representations of objects, for it is
computationally infeasible to generate features involving long-range dependencies. - "Kernel methods (Vapnik, 1998, Cristianini and Shawe-Taylor, 2000) are an attractive alternative
to feature-based methods. Kernel methods retain the original representation of objects and use the
object in algorithms only via computing a kernel function between a pair of objects. A kernel
function is a similarity function satisfying certain properties. More precisely, a kernel function
K over the object space X is binary function K : X×X → [0,∞] mapping a pair of objects x,y ∈ X to their similarity score K(x,y). A kernel function is required to be [[Symmetric?]] and [[Positive_Semidefinite?]]. - "In can be shown that any kernel function implicitly calculates the dot-product of feature vectors
of objects in high-dimensional feature spaces. That is, there exist features f (·) = (f1(·), f2(·), . . .),
fi : X→R, so that K(x,y) = f (x), f (y) . - "Conversely, given features f (·) = (f1(·), f2(·), . . .), a function defined as a dot product of the
corresponding feature vectors is necessarily a kernel function. - "In many cases, it may be possible to compute the dot product of certain features without enumerating all the features. An excellent example is that of subsequence kernels (Lodhi et al., 2002). In this case, the objects are strings of characters, and the kernel function computes the number of common subsequences of characters in two strings, where each subsequence match is additionally decreased by the factor reflecting how spread out the matched subsequence in the original sequences (Lodhi et al., 2002). Despite the exponential number of features (subsequences), it is possible to compute the subsequence kernel in polytime. We therefore are able to take advantage of long-range features in strings without enumerating the features explicitly. In Section 5.2, we will extend the subsequence kernel to operate on shallow parses for relation extraction.
- "Another pertinent example is that of parse tree kernels(Collins and Duffy, 2001), where objects
represent trees and the kernel function computes the number of common subtrees in two trees. The
tree kernel used within the Voted Perceptron learning algorithm (Freund and Schapire, 1999) was
shown to deliver excellent performance in Penn Treebank parsing. - "We also note that both subsequence and subtree kernels belong to a class of convolution kernels
(Haussler, 1999). Convolution kernels allow to compute the similarity between two objects based
on the similarities of objects’ parts. Although the kernels that we introduce are not convolution
kernels per se, they are closely related thereto. - "There are a number of learning algorithms that can operate only using the dot product of examples. The models produced by the learning algorithms are also expressed using only examples’ dot products. Substituting a particular kernel functions in place of the dot product defines a specific instantiation of such learning algorithms. The algorithms that process examples only via computing their dot products are sometimes called dual learning algorithms.
- "The Support Vector Machine (SVM) (Cortes and Vapnik, 1995) is a learning algorithm that not only allows for a dual formulation, but also provides a rigorous rationale for resisting overfitting (Vapnik, 1998). Indeed, for the kernel-based algorithms working in extremely rich (though implicit) feature spaces, it is crucial to deal with the problem of overtraining. Both theoretical and experimental results indicate that SVM is able [to] generalize very well and avoid overfitting in high (and even infinite) dimensional features spaces. In Section 6 we experimentally evaluate the Support Vector Machine for relation extraction.
- "After discovery of the kernel methods, several existing learning algorithms were shown to have dual analogues. For instance, the Perceptron learning algorithm (Rosenblatt, 1962) can be easily represented in the dual form (Cristianini and Shawe-Taylor, 2000). A variance-reducing improvement of Perceptron, Voted Perceptron (Freund and Schapire, 1999), is a robust and efficient learning algorithm that is very easy to implement. It has been shown to exhibit performance comparable to that of SVM. We employ the algorithm for relation extraction in Section 6.
- "We note that, from the learning system design perspective, the kernel methods shift the focus
from the problem of feature selection to the problem of kernel construction. Since kernel is the only
domain specific component of a kernel learning system, it is critical to design a kernel that adequately
encapsulates information necessary for prediction. On the other hand, we hypothesize that
use of long range dependencies in kernel computation will allow the algorithm implicitly explorer a
much larger space than that available to feature-based algorithms. We next show how to formalize
relation extraction as a learning problem.
4. Problem Formalization
- Definition 1 A node p is a set of attributes {a1,a2, . . .}. Each node may have a different number of
attributes. The attributes are named, and each node necessarily has attributes with names “Type” and “Role”. - We use p.a to denote the value of attribute with the name a in the node p, e.g., p.Type = Person and p.Role = member.
- Definition 2 An (unlabeled) relation example is defined inductively as follows:
- Let p be a node, then the pair P = (p, []) is a relation example, where by [] we denote an empty sequence.
- Let p be a node, and [P1,P2, . . . ,Pl ] be a sequence of relation examples. Then, the pair P = (p, [P1,P2, . . . ,Pl ]) is a relation example.
- "We say that p is the parent of P1,P2, . . . ,Pl, and Pi’s are the children of p. We denote by P.p the first element of the example pair, by P.c the second element of the example pair, and use the shorthand P.a to refer to P.p.a, and P[i] to denote Pi. If unambiguous, we also use P.ai to denote the child Pi of P such that Pi.Type = ai. A labeled relation example is unlabeled relation example augmented with a label l ∈ {−1,+1}. An example is positive, if l =+1, and negative, otherwise. We now define kernels on relation examples that represent similarity of two shallow parse trees.
5. Kernels for Relation Extraction
- "Kernels on parse trees were previously defined by Collins and Duffy (2001). The kernels enumerated (implicitly) all subtrees of two parse trees, and used the number of common subtrees, weighted appropriately, as the measure of similarity between two parse trees. Since we are operating with shallow parse trees, and the focus of our problem is relation extraction rather than parsing, we use a different definition of kernels.
- "The nodes of the shallow parse trees have attributes, and we need to use the attributes in the kernel definition. We define a primitive kernel function on the nodes in terms of nodes’ attributes, and then extend it on relation examples.
- "We first define a matching function t(·, ·) ∈ {0,1} and a similarity function k(·, ·) on nodes. The matching function defined on nodes determines whether the nodes are matchable or not. In the case of relation extraction, the nodes are matchable only if their types and roles match. Thus, if two nodes have different roles, and non-compatible types,6 then their node matching function is equal to zero; otherwise, it is equal to 1. The similarity function on nodes is computed in terms of the nodes’ attributes.
5.1 Contiguous Subtree Kernels
- "For contiguous subtree kernels, the similarity function Kc enumerates only children contiguous subsequences, that is, for a subsequence i in (2), is+1 = is+1 and d(i) = l(i). Since then d(i) = d(j) as well, we slightly abuse notation in this section by making λ stand for λ2 in formula (2).
5.2 Sparse Subtree Kernels
- "For sparse subtree kernels, we use the general definition of similarity between children sequences
as expressed by (2) - "The core of the kernel computation resides in the formula (3). The formula enumerates all contiguous subsequences of two children sequences. We now give a fast algorithm for computing Kc between P1 and P2, which, given kernel values for children, runs in time O(mn), where m and n is the number of children of P1 and P2, respectively.
- As can be seen from the recurrences, the time complexity of the algorithm is O(mn^3) (assuming m ≥ n). The space complexity is O(mn^2).
6. Experiments
- "In this section, we apply kernel methods to extracting two types of relations from text: person-affiliation and organization-location.
7 Discussion
- "One practical problem in applying kernel methods to NLP is their speed. Kernel classifiers are
relatively slow compared to feature classifiers. Indeed, an application of a kernel classifier requires
evaluation of numerous kernels whose computational complexity may be too high for practical
purposes. Many low level problems in natural language processing involve very large corpora with
tens and hundreds of thousands of examples. Even if kernel classifiers only depend on a small
subset of the examples (for instance, support vectors of SVM), the need to evaluate thousands
of complex kernels during the classifier application may render kernel methods inappropriate for
various practical settings. Therefore, there is a pressing need to develop algorithms that combine
the advantages of kernel methods with practical constraints that require efficient application of the
classifiers learned. - "Design of kernels for structural domains is a very rich research area. An interesting direction to pursue would be to use extensive work on distances defined on structural objects (Sankoff and Kruskal, 1999) in kernel design. The distance-based methods have already found widespread application in bioinformatics (Durbin et al., 1998), and can be fruitfully extended to work in the NLP domain as well. Watkins (1999) presents sufficient conditions for a Pair Hidden Markov Model (which is a probabilistic version of edit distance) to constitute a kernel. More generally, the work of Goldfarb (1985) makes it possible to use any distance measure to embed objects (and define a dot product) in a pseudo-euclidean space. Incidentally, SVM can be adapted for the pseudo-euclidean representations (Graepel et al., 1999, Pekalska et al., 2001), hence, lending the power of regularization to learning in structural domains, where natural distance functions exist.
8. Conclusions and Further Work
- "We presented an approach for using kernel-based machine learning methods for extracting relations from natural language sources. We defined kernels over shallow parse representations of text and designed efficient dynamic programming algorithms for computing the kernels. We applied SVM and Voted Perceptron learning algorithms with the kernels incorporated therein to the tasks of relation extraction. We also compared performance of the kernel-based methods with that of the feature methods, and concluded that kernels lead to superior performance.
- "Aside from a number of interesting research directions mentioned in Section 7, we intend to apply the kernel methodology to other sub-problems of information extraction. For example, the shallow parsing and entity extraction mechanism may also be learned, and, perhaps, combined in a seamless fashion with the relation extraction formalism presented herein. Furthermore, the real-world use of extraction results requires discourse resolution that collapses entities, noun phrases, and pronouns into a set of equivalence classes. We plan to apply kernel methods for discourse processing as well.
References
- S. Abney. Parsing by chunks. In Robert Berwick, Steven Abney, and Carol Tenny, editors, Principlebased
parsing. Kluwer Academic Publishers, 1990. - C. Aone, L. Halverson, T. Hampton, and M. Ramos-Santacruz. SRA: Description of the IE2 system used for MUC-7. In Proceedings of MUC-7, 1998.
- C. Aone and M. Ramos-Santacruz. REES: A large-scale relation and event extraction system. In Proceedings of the 6th Applied Natural Language Processing Conference, 2000.
- A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71, 1996.
- D. M. Bikel, R. Schwartz, and R. M. Weischedel. An algorithm that learns what’s in a name. Machine Learning, 34(1-3):211–231, 1999.
- M. Collins. New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Proceedings of 40th Conference of the Association for Computational Linguistics, 2002.
- M. Collins and N. Duffy. Convolution kernels for natural language. In Proceedings of NIPS-2001, 2001.
- C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995.
- N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines (and Other Kernel-Based Learning Methods). Cambridge University Press, 2000.
- R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley, New York, 1973.
- R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Seqience Analysis. Cambridge University Press, 1998.
- D. Freitag and A. McCallum. Information extraction with HMM structures learned by stochastic optimization. In Proceedings of the 7th Conference on Artificial Intelligence (AAAI-00) and of the 12th Conference on Innovative Applications of Artificial Intelligence (IAAI-00), pages 584–589, Menlo Park, CA, July 30– 3 2000. AAAI Press.
- Y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277–296, 1999.
- T. Furey, N. Cristianini, N. Duffy, D. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression. Bioinformatics, 16, 2000.
- L. Goldfarb. A new approach to pattern recognition. In Progress in pattern recognition 2. North-
Holland, 1985. - T. Graepel, R. Herbrich, and K. Obermayer. Classification on pairwise proximity data. In Advances
in Neural Information Processing Systems 11, 1999. - [Haussler, 1999] => D. Haussler. (1999). Convolution Kernels on Discrete Structures. Technical Report UCSC-CLR-99-10, University of California at Santa Cruz.
- R. A. Horn and C. A. Johnson. Matrix Analysis. Cambridge University press, Cambridge, 1985.
- F. Jelinek. Statistical Methods for Speech Recognition. The MIT Press, Cambridge, Massachusetts, 1997.
- T. Joachims. Text categorization with support vector machines: learning with many relevant features. European Conf. Mach. Learning, ECML98, April 1998.
- T. Joachims. Learning Text Classifiers with Support Vector Machines. Kluwer Academic Publishers, Dordrecht, NL, 2002.
- J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, pages 282–289. Morgan Kaufmann, San Francisco, CA, 2001.
- N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285, 1987.
- [LodhiSSCW, 2002] => H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. (2002). Text classification using string kernels. The Journal of Machine Learning Research, vol:2.
- A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models for information extraction and segmentation. In Proc. 17th International Conf. on Machine Learning, pages 591–598. Morgan Kaufmann, San Francisco, CA, 2000.
- [MillerCFRSSW, 1998] => S. Miller, M. Crystal, H. Fox, L. Ramshaw, R. Schwartz, R. Stone, R. Weischedel, and the Annotation Group. (1998). Algorithms that learn to extract information BBN: Description of the SIFT system as used for MUC-7. In Proceedings of MUC-7.
- M. Munoz, V. Punyakanok, D. Roth, and D. Zimak. A learning approach to shallow parsing. Technical
Report 2087, University of Illinois at Urbana-Champaign, Urbana, Illinois, 1999. - National Institute of Standars and Technology. Proceedings of the 6th Message Undertanding Conference (MUC-7), 1998.
- E. Pekalska, P. Paclik, and R. Duin. A generalized kernel approach to dissimilarity-based classification. Journal of Machine Learning Research, 2, 2001.
- L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 1990.
- F. Rosenblatt. Principles of Neurodynamics: Perceptrons and the theory of brain mechanisms. Spartan Books, Washington D.C., 1962.
- D. Roth. Learning in natural language. In Dean Thomas, editor, Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI-99-Vol2), pages 898–904, S.F., July 31–August 6 1999. Morgan Kaufmann Publishers.
- D. Roth and W. Yih. Relational learning via propositional algorithms: An information extraction case study. In Bernhard Nebel, editor, Proceedings of the seventeenth International Conference on Artificial Intelligence (IJCAI-01), pages 1257–1263, San Francisco, CA, August 4–10 2001. Morgan Kaufmann Publishers, Inc.
- D. Sankoff and J. Kruskal, editors. Time Warps, String Edits, and Macromolecules. CSLI Pulications, 1999.
- C. J. van Rijsbergen. Information Retrieval. Butterworths, 1979.
- V. Vapnik. Statistical Learning Theory. John Wiley, 1998.
- [Watkins, 2000] => C. Watkins. (2000). http://www.svms.org/kernels/Watk99.pdf">Dynamic alignment kernels. In A.J. Smola, P.L. Bartlett, B. Schlkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers.