2000 MaxEntMarkModForIEandSeg

Jump to: navigation, search

Subject Headings: Maximum Entropy Markov Model, Part-of-Speech Tagging Task, Text Segmentation Task.


Cited By

  • Cited ~586 times …



Hidden Markov models (HMMs) are a powerful probabilistic tool for modeling sequential data, and have been applied with success to many text-related tasks, such as part-of-speech tagging, text segmentation and information extraction. In these cases, the observations are usually modeled as multinomial distributions over a discrete vocabulary, and the HMM parameters are set to maximize the likelihood of the observations. This paper presents a new Markovian sequence model, closely related to HMMs, that allows observations to be represented as arbitrary overlapping features (such as word, capitalization, formatting, part-of-speech), and defines the conditional probability of state sequences given observation sequences. It does this by using the maximum entropy framework to fit a set of exponential models that represent the probability of a state given an observation and the previous state. We present positive experimental results on the segmentation of FAQ’s."

1. Introduction

The large volume of text available on the Internet is causing an increasing interest in algorithms that can automatically process and mine information from this text. Hidden Markov models (HMMs) are a powerful tool for representing sequential data, and have been applied with significant success to many text-related tasks, including part-of-speech tagging (Kupiec, 1992), text segmentation and event tracking (Yamron, Carp, Gillick, Lowe, & van Mulbregt, 1998), named entity recognition (Bikel, Schwartz, & Weischedel, 1999) and information extraction (Leek, 1997; Freitag & McCallum, 1999).

HMMs are probabilistic finite state models with parameters for state-transition probabilities and state-specific observation probabilities. Greatly contributing to their popularity is the availability of straightforward procedures for training by maximum likelihood (Baum-Welch) and for using the trained models to find the most likely hidden state sequence corresponding to an observation sequence (Viterbi). In text-related tasks, the observation probabilities are typically represented as a multinomial distribution over a discrete, finite vocabulary of words, and Baum-Welch training is used to learn parameters that maximize the probability of the observation sequences in the training data.

This paper introduces maximum entropy Markov models (MEMMs), which address both of these concerns. To allow for non-independent, difficult to enumerate observation features, we move away from the generative, joint probability parameterization of HMMs to a conditional model that represents the probability of reaching a state given an observation and the previous state. These conditional probabilities are specified by exponential models based on arbitrary observation features. The exponential models follow from a maximum entropy argument, and are trained by generalized iterative scaling (GIS) (Darroch & Ratcliff, 1972), which is similar in form and computational cost to the expectation-maximization (EM) algorithm (Dempster, Laird, &Rubin, 1977). The “three classic problems” (Rabiner, 1989) of HMMs can all be straightforwardly solved in this new model with new variants of the forward-backward, Viterbi and Baum-Welch algorithms.

4. Related Work

However, we know of no previous general method that combines the rich state representation of Markov models with the flexible feature combination of exponential models. The MENE named-entity recognizer (Borthwick, Sterling, Agichtein, & Grishman, 1998) uses an exponential model to label each word with a label indicating the position of the word in a labeled-entity class (start, inside, end or singleton), but the conditioning information does not include the previous label, unlike our model. Therefore, it is closer to our ME-Stateless model. It is possible that its inferior performance compared to an HMM-based named-entity recognizer (Bikel et al., 1999) may have similar causes to the corresponding weakness of ME-Stateless relative to FeatureHMM in our experiments — the lack of representation of sequential dependencies.


  • Argamon, S., Dagan, I., & Krymolowski, Y. (1998). A memory-based approach to learning shallow natural language patterns. In COLING-ACL 98, pp. 67–73 New Brunswick, New Jersey. Association for Computational Linguistics.
  • (Beeferman et al., 1999) ⇒ Doug Beeferman, Adam Berger, and John D. Lafferty. (1999). “Statistical Models for Text Segmentation.” In: Machine Learning, 34(1–3).
  • Bikel, D. M., Schwartz, R. L., & Weischedel, R. M. (1999). An algorithm that learns what’s in a name. Machine Learning Journal, 34, 211–231. Borthwick, A., Sterling, J., Eugene Agichtein, & Grishman, R. (1998).
  • (Borthwick et al., 1998) ⇒ Andrew Borthwick, John Sterling, Eugene Agichtein, and Ralph Grishman. (1998). “Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition.” In: Proceedings of the Sixth Workshop on Very Large Corpora.
  • Brill, E. (1995). Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging. Computational Linguistics, 21(4), 543–565.
  • Burke, R., Hammond, K., Kulyukin, V., Lytinen, S., & Tomuro, N. (1997). Question answering from frequently-asked question files: Experiences with the FAQ Finder system. AI Magazine, 18, 57–66.
  • Chen, S., & Rosenfeld, R. (1999). Efficient sampling and feature selection in whole sentence maximum entropy language models. In: Proceedings of ICASSP’99. IEEE.
  • Darroch, J. N., & Ratcliff, D. (1972). Generlized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5), 1470–1480.
  • Della Pietra, S., Della Pietra, V., & John D. Lafferty (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4).
  • Arthur P. Dempster, Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1–38.
  • Dayne Freitag, & Andrew McCallum K. (1999). Information extraction using hmms and shrinkage. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extration, pp. 31–36 Menlo Park, California. AAAI.
  • Ghahramani, Z., & Michael I. Jordan (1996). Factorial hidden Markov models. In Mozer, M., Touretzky, D., & Perrone, M. (Eds.), Advances in Neural Information Processing Systems 8. MIT Press.
  • Kanazawa, K., Koller, D., & Russell, S. (1995). Stochastic simulation algorithms for dynamic probabilistic networks. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence Montreal, Canada. Morgan Kaufmann.
  • Kupiec, J. (1992). Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language, 6, 225– 242.
  • Leek, T. R. (1997). Information extraction using hidden Markov models. Master’s thesis, UC San Diego. Paz, A. (1971). Introduction to Probabilistic Automata. Academic Press.
  • Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–285.
  • Adwait Ratnaparkhi (1998). Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. thesis, University of Pennsylvania.
  • Rosenfeld, R. (1994). Adaptive Statistical Language Modeling: A Maximum Entropy Approach. Ph.D. thesis, Carnegie Mellon University.
  • Dan Roth. (1998). Learning to resolve natural language ambiguities: a unified approach. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence, pp. 806– 813 Menlo Park, California. AAAI Press.
  • Saul, L.,& Rahim, M. (1999). Markov processes on curves for automatic speech recognition. In Kearns, M. S., Solla, S. A., & Cohn, D. A. (Eds.), Advances in Neural Information Processing Systems, Vol. 11 Cambridge, Massachusetts. MIT Press.
  • Yamron, J., Carp, I., Gillick, L., Lowe, S., & van Mulbregt, P. (1998). A hidden Markov model approach to text segmentation and event tracking. In: Proceedings of ICASSP’98. IEEE.


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2000 MaxEntMarkModForIEandSegAndrew McCallum
Dayne Freitag
Fernando Pereira
Maximum Entropy Markov Models for Information Extraction and SegmentationProceedings of ICML-2000http://www.cs.umass.edu/~mccallum/papers/memm-icml2000.ps2000