# 2007 IntroToConditionalRandomFields

(Redirected from Sutton & McCallum, 2007)

Subject Headings: Conditional Random Fields, Multi-Relational Data Mining

## Quotes

### Abstract

Conditional random fields (CRFs) combine the modeling flexibility of graphical models with the ability to use rich, nonindependent features of the input. In this tutorial, we review modeling, reference, and parameter estimation in CRFs, both on linear chains and on general graphical structures. We discuss differences between generative and discriminative modeling, latent-variable conditional models, and practical aspects of CRF implementations. Finally, we present a case-study applying a loopy CRF to a relational problem in natural language processing.

### 1.1 Introduction

Relational data has two characteristics: first, statistical dependencies exist between the entities we wish to model, and second, each entity often has a rich set of features that can aid classification. For example, when classifying Web documents, the page’s text provides much information about the class label, but hyperlinks define a relationship between pages that can improve classification [Taskar et al., 2002]. Graphical models are a natural formalism for exploiting the dependence structure among entities. Traditionally, graphical models have been used to represent the joint probability distribution p(y, x), where the variables y represent the attributes of the entities that we wish to predict, and the input variables x represent our observed knowledge about the entities. But modeling the joint distribution can lead to difficulties when using the rich local features that can occur in relational data, because it requires modeling the distribution p(x), which can include complex dependencies. Modeling these dependencies among inputs can lead to intractable models, but ignoring them can lead to reduced performance.

A solution to this problem is to directly model the conditional distribution $p(y \vert x)$, which is sufficient for classification. This is the approach taken by conditional random fields [Lafferty et al., 2001]. A conditional random field is simply a conditional distribution $p(\mathbf{y} \vert \mathbf{x})$ with an associated graphical structure. Because the model is conditional, dependencies among the input variables x do not need to be explicitly represented, affording the use of rich, global features of the input. For example, in natural language tasks, useful features include neighboring words and word bigrams, prefixes and suffixes, capitalization, membership in domain-specific lexicons, and semantic information from sources such as WordNet. Recently there has been an explosion of interest in CRFs, with successful applications including text processing [Taskar et al., 2002, Peng and McCallum, 2004, Settles, 2005, Sha and Pereira, 2003], bioinformatics [Sato and Sakakibara, 2005, Liu et al., 2005], and computer vision [He et al., 2004, Kumar and Hebert, 2003].

This chapter is divided into two parts. First, we present a tutorial on current training and inference techniques for conditional random fields. We discuss the important special case of linear-chain CRFs, and then we generalize these to arbitrary graphical structures. We include a brief discussion of techniques for practical CRF implementations.

Second, we present an example of applying a general CRF to a practical relational learning problem. In particular, we discuss the problem of information extraction, that is, automatically building a relational database from information contained in unstructured text. Unlike linear-chain models, general CRFs can capture long distance dependencies between labels. For example, if the same name is mentioned more than once in a document, all mentions probably have the same label, and it is useful to extract them all, because each mention may contain different complementary information about the underlying entity. To represent these long-distance dependencies, we propose a skip-chain CRF, a model that jointly performs segmentation and collective labeling of extracted mentions. On a standard problem of extracting speaker names from seminar announcements, the skip-chain CRF has better performance than a linear-chain CRF.

### 1.2.1 Definitions

We consider probability distributions over sets of random variables $V$ = $X$$Y$, where $X$ is a set of input variables that we assume are observed, and $Y$ is a set of output variables that we wish to predict. Every variable $v$$V$ takes outcomes from a set V, which can be either continuous or discrete, although we discuss only the discrete case in this chapter. We denote an assignment to $X$ by $\mathbf{x}$, and we denote an assignment to a set AX by xA, and similarly for Y. We use the notation 1{x=x0} to denote an indicator function of $x$ which takes the value 1 when $x$ = x and 0 otherwise.

A graphical model is a family of probability distributions that factorize according to an underlying graph. The main idea is to represent a distribution over a large number of random variables by a product of local functions that each depend on only a small number of variables. Given a collection of subsets AV, we define an undirected graphical model as the set of all distributions that can be written in the form

for any choice of factors $F$ = {ΨA}, where ΨA://VnR+. (These functions are also called local functions or compatibility functions.) We will occasionally use the term random field to refer to a particular distribution among those defined by an undirected model. To reiterate, we will consistently use the term model to refer to a family of distributions, and random field (or more commonly, distribution) to refer to a single one.

#### 1.2.2 Applications of graphical models

In this section we discuss a few applications of graphical models to natural language processing. Although these examples are well-known, they serve both to clarify the definitions in the previous section, and to illustrate some ideas that will arise again in our discussion of conditional random fields. We devote special attention to the hidden Markov model (HMM), because it is closely related to the linear-chain CRF.

### 1.2.2.1 Classification

First we discuss the problem of classification, that is, predicting a single class variable y given a vector of features x = (x1, x2, . . ., xK). One simple way to accomplish this is to assume that once the class label is known, all the features are independent. The resulting classifier is called the naive Bayes classifier. It is based on a joint probability model of the form:

$p(y,x) = p(y) KY k=1 p(x_k \vert y)$. (1.5)

This model can be described by the directed model shown in Figure 1.1 (left). We can also write this model as a factor graph, by defining a factor (y) = p(y), and a factor $k(y,x_k) = p(x_k \vert \vert y)$ for each feature $x_k$. This factor graph is shown in Figure 1.1 (right).

Another well-known classifier that is naturally represented as a graphical model is logistic regression (sometimes known as the maximum entropy classifier in the NLP community). In statistics, this classifier is motivated by the assumption that the log probability, $\log p(y \vert x)$, of each class is a linear function of x, plus a normalization constant. This leads to the conditional distribution:

$p(y \vert x)$ = 1 Z(x) exp 8< : y + XK j=1 y,jxj 9= ;, (1.6)

where Z(x) = P y exp{ y + PK j=1 y,jxj} is a normalizing constant, and y is a bias weight that acts like log p(y) in naive Bayes. Rather than using one vector per class, as in (1.6), we can use a different notation in which a single set of weights is shared across all the classes. The trick is to define a set of feature functions that are nonzero only for a single class. To do this, the feature functions can be defined as fy0,j(y, x) = 1{y0=y}xj for the feature weights and fy0 (y, x) = 1{y0=y} for the bias weights. Now we can use fk to index each feature function fy0,j, and k to index its corresponding weight y0,j . Using this notational trick, the logistic regression model becomes:

$p(y \vert x)$ = 1 Z(x) exp (XK k=1 kfk(y, x)). (1.7)

We introduce this notation because it mirrors the usual notation for conditional random fields.

### 1.2.2.2 Sequence Models

Classifiers predict only a single class variable, but the true power of graphical models lies in their ability to model many variables that are interdependent. In this section, we discuss perhaps the simplest form of dependency, in which the output variables are arranged in a sequence. To motivate this kind of model, we discuss an application from natural language processing, the task of named-entity recognition (NER). NER is the problem of identifying and classifying proper names in text, including locations, such as China ; people, such as George Bush ; and organizations, such as the United Nations. The named-entity recognition task is, given a sentence, first to segment which words are part of entities, and then to classify each entity by type (person, organization, location, and so on). The challenge of this problem is that many named entities are too rare to appear even in a large training set, and therefore the system must identify them based only on context.

### 1.2.3 Discriminative and Generative Models

An important difference between naive Bayes and logistic regression is that naive Bayes is generative, meaning that it is based on a model of the joint distribution $p(y,x)$, while logistic regression is discriminative, meaning that it is based on a model of the conditional distribution $p(y,x)$. In this section, we discuss the differences between generative and discriminative modeling, and the advantages of discriminative modeling for many tasks. For concreteness, we focus on the examples of naive Bayes and logistic regression, but the discussion in this section actually applies in general to the differences between generative models and conditional random fields.

The main difference is that a conditional distribution $p(\mathbf{y} \vert \mathbf{x}) p(y \vert x)$ does not include a model of p(x), which is not needed for classification anyway. The difficulty in modeling $p(x)$ is that it often contains many highly dependent features, which are difficult to model. For example, in named-entity recognition, an HMM relies on only one feature, the word’s identity. But many words, especially proper names, will not have occurred in the training set, so the word-identity feature is uninformative. To label unseen words, we would like to exploit other features of a word, such as its capitalization, its neighboring words, its prefixes and suffixes, its membership in predetermined lists of people and locations, and so on …

Figure 1.2 Diagram of the relationship between naive Bayes, logistic regression, HMMs, linear-chain CRFs, generative models, and general CRFs.

This means that if the naive Bayes model (1.5) is trained to maximize the conditional likelihood, we recover the same classifier as from logistic regression. Conversely, if the logistic regression model is interpreted generatively, as in (1.9), and is trained to maximize the joint likelihood p(y, x), then we recover the same classifier as from naive Bayes. In the terminology of Ng and Jordan [2002], naive Bayes and logistic regression form a generative-discriminative pair. The principal advantage of discriminative modeling is that it is better suited to including rich, overlapping features.

### 1.3 Linear-Chain Conditional Random Fields

In the previous section, we have seen advantages both to discriminative modeling and sequence modeling. So it makes sense to combine the two. This yields a linear-chain CRF, which we describe in this section. First, in Section 1.3.1, we define linear-chain CRFs, motivating them from HMMs. Then, we discuss parameter estimation (Section 1.3.2) and inference (Section 1.3.3) in linear-chain CRFs.

#### 1.3.3 Inference

A final inference task that is useful in some applications is to compute a marginal probability $p(y_t, y_{t+1}, … y_{t+k} \vert \mathbf{x})$ over a range of nodes. For example, this is useful for measuring the model’s confidence in its predicted labeling over a segment of input. This marginal probability can be computed efficiently using constrained forward-backward, as described by Culotta and McCallum [2004].

### 1.4 CRFs in General

In this section, we define CRFs with general graphical structure, as they were introduced originally [Lafferty et al., 2001]. Although initial applications of CRFs used linear chains, there have been many later applications of CRFs with more general graphical structures. Such structures are especially useful for relational learning, because they allow relaxing the iid assumption among entities. Also, although CRFs have typically been used for across-network classification, in which the training and testing data are assumed to be independent, we will see that CRFs can be used for within-network classification as well, in which we model probabilistic dependencies between the training data and testing data.

The generalization from linear-chain CRFs to general CRFs is fairly straight-forward. We simply move from using a linear-chain factor graph to a more general factor graph, and from forward-backward to more general (perhaps approximate) inference algorithms.

#### 1.4.1 Model

First we present the general definition of a conditional random field.

Definition 1.2

Let $G$ be a factor graph over $Y$ . Then $p(\mathbf{y} \vert \mathbf{x})$ is a conditional random field if for any fixed $\mathbf{x}$, the distribution $p(\mathbf{y} \vert \mathbf{x})$ factorizes according to $G$.

Thus, every conditional distribution $p(y \vert x)$ is a CRF for some, perhaps trivial, factor graph. If $F$ = {ΨA} is the set of factors in $G$, and each factor takes the exponential family form (1.3), then the conditional distribution can be written as:

In addition, practical models rely extensively on parameter tying. For example, in the linear-chain case, often the same weights are used for the factors t(yt, yt−1, xt) at each time step. To denote this, we partition the factors of G into C = {C1,C2, . . .CP }, where each Cp is a clique template whose parameters are tied. This notion of clique template generalizes that in Taskar et al. [2002], Sutton et al. [2004], and Richardson and Domingos [2005]. Each clique template Cp is a set of factors which has a corresponding set of sufficient statistics {fpk(xp, yp)} and parameters p 2 <K(p). Then the CRF can be written as

where each factor is parameterized as and the normalization function is …

For example, in a linear-chain conditional random field, typically one clique template $C$ = {t(yt, yt−1, xt)}Tt=1 is used for the entire network.

Several special cases of conditional random fields are of particular interest. First, dynamic conditional random fields [Sutton et al., 2004] are sequence models which allow multiple labels at each time step, rather than single labels as in linear-chain CRFs. Second, relational Markov networks [Taskar et al., 2002] are a type of general CRF in which the graphical structure and parameter tying are determined by an SQL-like syntax. Finally, Markov logic networks (Richardson and Domingos, 2005, Singla and Domingos, 2005) are a type of probabilistic logic in which there are parameters for each first-order rule in a knowledge base.

## References

• Pieter Abbeel, Daphne Koller, and Andrew Y. Ng. Learning factor graphs in polynomial time and sample complexity. In Twenty-first Conference on Uncertainty in Artificial Intelligence (UAI05), 2005.
• Yasemin Altun, Ioannis Tsochantaridis, and Thomas Hofmann. Hidden Markov support vector machines. In 20th International Conference on Machine Learning (ICML), 2003.
• Dimitri P. Bertsekas. Nonlinear Programming. Athena Scientific, 2nd edition, 1999.
• Julian Besag. E ciency of pseudolikelihood estimation for simple gaussian fields. Biometrika, 64(3):616–618, 1977. Razvan C. Bunescu and Raymond Mooney. Collective information extraction with relational Markov networks. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 2004. Richard H. Byrd, Jorge Nocedal, and Robert B. Schnabel. Representations of quasi- Newton matrices and their use in limited memory methods. Math. Program., 63(2):129–156, 1994.
• Rich Caruana and Alexandru Niculescu-Mizil. An empirical comparison of supervised learning algorithms using different performance metrics. Technical Report TR2005-1973, Cornell University, (2005). Available at http://www.cs.cornell.edu/ alexn/.
• Yejin Choi, Claire Cardie, Ellen Rilo, and Siddharth Patwardhan. Identifying sources of opinions with conditional random fields and extraction patterns. In: Proceedings of the Human Language Technology Conference/Conference on Empirical Methods in Natural Language Processing (HLT-EMNLP), 2005.
• Stephen Clark and James R. Curran. Parsing the WSJ using CCG and log-linear models. In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, pages 103–110, Barcelona, Spain, July 2004.
• Michael Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2002.
• Philip J. Cowans and Martin Szummer. A graphical model for simultaneous partitioning and labeling. In Tenth International Workshop on Artificial Intelligence and Statistics, 2005.
• Aron Culotta, Ron Bekkerman, and Andrew McCallum. Extracting social networks and contact information from email and the web. In First Conference on Email and Anti-Spam (CEAS), Mountain View, CA, 2004.
• Aron Culotta and Andrew McCallum. Confidence estimation for information extraction. In Human Language Technology Conference (HLT), 2004.
• Jenny Finkel, Trond Grenager, and Christopher D. Manning. Incorporating nonlocal information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), 2005.
• Dayne Freitag. Machine Learning for Information Extraction in Informal Domains. PhD thesis, Carnegie Mellon University, 1998.
• Nadia Ghamrawi and Andrew McCallum. Collective multi-label classification. In Conference on Information and Knowledge Management (CIKM), 2005.
• Joshua Goodman. Exponential priors for maximum entropy models. In: Proceedings of the Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics (HLT/NAACL), 2004.
• Xuming He, Richard S. Zemel, and Miguel ´A. Carreira-Perpi˜ni´an. Multiscale conditional random fields for image labelling. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004.
• G.E. Hinton. Training products of experts by minimizing contrastive divergence. Technical Report 2000-004, Gatsby Computational Neuroscience Unit, 2000.
• F. R. Kschischang, B. J. Frey, and H. A. Loeliger. Factor graphs and the sumproduct algorithm. IEEE Transactions on Information Theory, 47(2):498–519, 2001.
• Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. Applying conditional random fields to Japanese morphological analysis. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2004.
• Sanjiv Kumar and Martial Hebert. Discriminative fields for modeling spatial dependencies in natural images. In Sebastian Thrun, Lawrence Saul, and Bernhard Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2003.
• John D. Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of 18th International Conference on Machine Learning, 2001.
• Yan Liu, Jaime Carbonell, Peter Weigele, and Vanathi Gopalakrishnan. Segmentation conditional random fields (SCRFs): A new approach for protein fold recognition. In ACM International conference on Research in Computational Molecular Biology (RECOMB05), 2005.
• R. Malouf. A comparison of algorithms for maximum entropy parameter estimation. In Dan Roth and Antal van den Bosch, editors, Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002), pages 49–55, 2002.
• David McAllester, Michael Collins, and Fernando Pereira. Case-factor diagrams for structured probabilistic modeling. In Conference on Uncertainty in Artificial Intelligence (UAI), 2004.
• Andrew McCallum. Effiently inducing features of conditional random fields. In Conference on Uncertainty in AI (UAI), 2003.
• Andrew McCallum, Kedar Bellare, and Fernando Pereira. A conditional random field for discriminatively-trained finite-state string edit distance. In Conference on Uncertainty in AI (UAI), 2005.
• Andrew McCallum and David Jensen. A note on the unification of information extraction and data mining using conditional-probability, relational models. In IJCAI’03 Workshop on Learning Statistical Models from Relational Data, 2003.
• Andrew McCallum and Wei Li. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Seventh Conference on Natural Language Learning (CoNLL), 2003.
• Andrew McCallum and Ben Wellner. Conditional models of identity uncertainty with application to noun coreference. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 905–912. MIT Press, Cambridge, MA, 2005.
• Thomas P. Minka. A comparsion of numerical optimizers for logistic regression. Technical report, (2003). http://research.microsoft.com/minka/papers/logreg/.
• Tom. Minka. Discriminative models, not discriminative training. Technical Report MSR-TR-2005-144, Microsoft Research, October 2005. ftp://ftp.research.microsoft.com/ pub/tr/TR-2005-144.pdf .
• A. Y. Ng and Michael I. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Thomas G. Dietterich, S. Becker, and Zoubin Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 841–848, Cambridge, MA, (2002). MIT Press.
• Fuchun Peng, Fangfang Feng, and Andrew McCallum. Chinese segmentation and new word detection using conditional random fields. In: Proceedings of The 20th International Conference on Computational Linguistics (COLING), pages 562– 568, 2004.
• Fuchun Peng and Andrew McCallum. Accurate information extraction from research papers using conditional random fields. In: Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL’04), 2004.
• Leonid Peshkin and Avi Pfeffer. Bayesian information extraction network. In International Joint Conference on Artificial Intelligence (IJCAI), 2003.
• Yuan Qi, Martin Szummer, and Thomas P. Minka. Diagram structure recognition by Bayesian conditional random fields. In: Proceedings of The International Conference on Computer Vision and Pattern Recognition, 2005.
• Ariadna Quattoni, Michael Collins, and Trevor Darrell. Conditional random fields for object recognition. In Lawrence K. Saul, YairWeiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1097–1104. MIT Press, Cambridge, MA, 2005.
• Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257 – 286, 1989. Matthew Richardson and Pedro Domingos. Markov logic networks. Machine Learning, (2005). To appear.
• S. Riezler, T. King, R. Kaplan, R. Crouch, J. Maxwell, and M. Johnson. Parsing the Wall Street Journal using a lexical-functional grammar and discriminative estimation techniques. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002.
• Dan Roth and W. Yih. Integer linear programming inference for conditional random fields. In: Proceedings of the International Conference on Machine Learning (ICML), pages 737–744, 2005.
• Sunita Sarawagi, and William W. Cohen. Semi-Markov conditional random fields for information extraction. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1185–1192. MIT Press, Cambridge, MA, 2005.
• Kengo Sato and Yasubumi Sakakibara. RNA secondary structural alignment with conditional random fields. Bioinformatics, 21:ii237–242, 2005.
• Burr Settles. Abner: an open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics, 21(14):3191–3192, 2005.
• Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL, pages 213–220, 2003.
• P. Singla and Pedro Domingos. Discriminative training of Markov logic networks. In Proceedings of the Twentieth National Conference on Artificial Intelligence, pages 868–873, Pittsburgh, PA, (2005). AAAI Press.
• Charles Sutton. Conditional probabilistic context-free grammars. Master’s thesis, University of Massachusetts, (2004). http://www.cs.umass.edu/casutton/publications.html.
• Charles Sutton and Andrew McCallum. Piecewise training of undirected models. In 21st Conference on Uncertainty in Artificial Intelligence, 2005.
• Charles Sutton, Khashayar Rohanimanesh, and Andrew McCallum. Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. In: Proceedings of the Twenty-First International Conference on Machine Learning (ICML), 2004.
• Ben Taskar, Pieter Abbeel, and Daphne Koller. Discriminative probabilistic models for relational data. In Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI02), 2002. Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin Markov networks. In Sebastian Thrun, Lawrence Saul, and Bernhard Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.
• Paul Viola and Mukund Narasimhan. Learning to extract information from semistructured text using a discriminative context free grammar. In: Proceedings of the ACM SIGIR, 2005.
• M. Wainwright, T. Jaakkola, and A. Willsky. Tree-based reparameterization for approximate estimation on graphs with cycles. Advances in Neural Information Processing Systems (NIPS), 2001.
• M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree-reweighted belief propagation and approximate ML estimation by pseudo-moment matching. In Ninth Workshop on Artificial Intelligence and Statistics, 2003.
• Hanna Wallach. E cient training of conditional random fields. M.Sc. thesis, University of Edinburgh, 2002.
• Ben Wellner, Andrew McCallum, Fuchun Peng, and Michael Hay. An integrated, conditional model of information extraction and coreference with application to citation graph construction. In 20th Conference on Uncertainty in Artificial Intelligence (UAI), 2004.
• J.S. Yedidia, W.T. Freeman, and Yair Weiss. Constructing free energy approximations and generalized belief propagation algorithms. Technical Report TR2004-040, Mitsubishi Electric Research Laboratories, 2004.,

volumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 IntroToConditionalRandomFieldsAn Introduction to Conditional Random Fields for Relational Learninghttp://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf2007
 Author Charles Sutton + and Andrew McCallum + title An Introduction to Conditional Random Fields for Relational Learning + titleUrl http://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf + year 2007 +