# 2001 ConditionalRandomFields

## Quotes

### Abstract

We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.

### 1. Introduction

The need to segment and label sequences arises in many different problems in several scientific fields. Hidden Markov models (HMMs) and stochastic grammars are well understood and widely used probabilistic models for such problems. In computational biology, HMMs and stochastic grammars have been successfully used to align biological sequences, find sequences homologous to a known evolutionary family, and analyze RNA secondary structure (Durbin et al., 1998). In computational linguistics and computer science, HMMs and stochastic grammars have been applied to a wide variety of problems in text and speech processing, including topic segmentation, part-of-speech (POS) tagging, information extraction, and syntactic disambiguation (Manning & Schütze, 1999).

### 3. Conditional Random Fields

In what follows, X is a random variable over data sequences to be labeled, and Y is a random variable over corresponding label sequences. All components $Y_i$ of $Y$ are assumed to range over a finite label alphabet A. For example, $X$ might range over natural language sentences and $Y$ range over part-of-speech taggings of those sentences, with A the set of possible part-of-speech tags. The random variables $X$ and $Y$ are jointly distributed, but in a discriminative framework we construct a conditional model $p(Y\vert X)$ from paired observation and label sequences, and do not explicitly model the marginal $p(X)$.

Definition: Let $G = (V,E)$ be a graph such that $Y = (Y_v)_{v \in V}$, so that $Y$ is indexed by the vertices of $G$. Then $(X,Y)$ is a conditional random field in case, when conditioned on $X$, the random variables $Y_v$ obey the Markov property with respect to the graph: $p(Y_v \vert X,Y_w,w \neq v) = p(Y_v \vert X,Y_w, \lt math\gt w$ \sim v)[/itex], where w~v means that $w$ and $v$ are neighbors in $G$.

## References

• Steven P. Abney, Robert E. Schapire, & Singer, Y. (1999). Boosting applied to tagging and PP attachment. Proceedings of EMNLPVLC. New Brunswick, New Jersey: Association for Computational Linguistics.
• (Berger et al., 1996) ⇒ Adam L. Berger, Vincent J. Della Pietra, and Stephen A. Della Pietra. (1996). “A Maximum Entropy Approach to Natural Language Processing.” In: Computational Linguistics, 22(1).
• Bottou, L. (1991). Une approche théorique de l’apprentissage connexionniste: Applications `a la reconnaissance de la parole. Doctoral dissertation, Université de Paris XI.
• E. Brill. (1995). Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging. Computational Linguistics, 21, 543– 565.
• Michael Collins (2000). Discriminative reranking for natural language parsing. Proceedings of ICML (2000). Stanford, California.
• Michael Collins, Robert E. Schapire, & Singer, Y. (2000). Logistic regression, AdaBoost, and Bregman distances. Proceedings of 13th COLT. scaling for log-linear models. The Annals of Mathematical Statistics, 43, 1470–1480.
• Della Pietra, S., Della Pietra, V., & John D. Lafferty (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 380–393.
• Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press.
• Dayne Freitag, & Andrew McCallum (2000). Information extraction with HMM structures learned by stochastic optimization. Proceedings of AAAI (2000).
• Yoav Freund, & Robert E. Schapire (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119–139.
• Hammersley, J., & Clifford, P. (1971). Markov fields on finite graphs and lattices. Unpublished manuscript.
• LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.
• MacKay, D. J. (1996). Equivalence of linear Boltzmann chains and hidden Markov models. Neural Computation, 8, 178–181.
• Christopher D. Manning, & Hinrich Schütze (1999). Foundations of statistical natural language processing. Cambridge Massachusetts: MIT Press.
• Andrew McCallum, Dayne Freitag, & Fernando Pereira (2000). Maximum entropy Markov models for information extraction and segmentation. Proceedings of ICML 2000 (pp. 591–598). Stanford, California.
• Mohri, M. (1997). Finite-state transducers in language and speech processing. Computational Linguistics, 23.
• Mohri, M. (2000). Minimization algorithms for sequential transducers. Theoretical Computer Science, 234, 177– 201.
• Paz, A. (1971). Introduction to probabilistic automata. Academic Press.
• Punyakanok, V., & Dan Roth. (2001). The use of classifiers in sequential inference. NIPS 13. Forthcoming.
• Adwait Ratnaparkhi (1996). A maximum entropy model for part-of-speech tagging. Proceedings of EMNLP. New Brunswick, New Jersey: Association for Computational Linguistics.
• Rosenfeld, R. (1997). A whole sentence maximum entropy language model. Proceedings of the IEEE Workshop on Speech Recognition and Understanding. Santa Barbara, California.
• Dan Roth. (1998). Learning to resolve natural language ambiguities: A unified approach. Proceedings of 15th AAAI (pp. 806– 813). Menlo Park, California: AAAI Press.
• Saul, L., & Michael I. Jordan (1996). Boltzmann chains and hidden Markov models. Advances in Neural Information Processing Systems 7. MIT Press.
• Schwartz, R., & Austin, S. (1993). A comparison of several approximate algorithms for finding multiple (N-BEST) sentence hypotheses. Proceedings of ICASSP. Minneapolis, MN.

,

volumeDate ValuetitletypejournaltitleUrldoinoteyear
2001 ConditionalRandomFieldsConditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence DataICML 2001http://www.cis.upenn.edu/~pereira/papers/crf.pdf2001
 Author John D. Lafferty +, Andrew McCallum + and Fernando Pereira + journal Proceedings of the Eighteenth International Conference on Machine Learning + title Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data + titleUrl http://www.cis.upenn.edu/~pereira/papers/crf.pdf + year 2001 +