Maximum Entropy Markov Model (MEMM)

AKA: Conditional Markov Model (CMM).
Context:
- It can be instantiated as a … (Finite-State Sequence Tagging Model).
- It can be trained by a MEMM Training System (that implements a MEMM training algorithm).
- …
Counter-Example(s):
- a Hidden Markov Model.
- a Conditional Probabilistic Graphical Model, such as a Conditional Random Fields Model.
See: Markov Random Field, Logistic Regression Algorithm, Label-Bias Problem.

References

(Jie Tang, 2005) ⇒ Jie Tang. (2005). “An Introduction for Conditional Random Fields." Literature Survey ¨C 2, Dec, 2005, at Tsinghua.
- QUOTE: Label bias problem: the probability transitions leaving any given state must sum to one

(Zelenko et al., 2003) ⇒ Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. (2003). “Kernel Methods for Relation Extraction.” In: Journal of Machine Learning Research, 3.
- QUOTE: MEMMs are able to model more complex transition and emission probability distributions and take into account various text features.

(Lafferty et al., 2001) ⇒ John D. Lafferty, Andrew McCallum, and Fernando Pereira. (2001). “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.” In: Proceedings of ICML 2001.
- QUOTE: … avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states.

(McCallum et al., 2000a) ⇒ Andrew McCallum, Dayne Freitag, and Fernando Pereira. (2000). “Maximum Entropy Markov Models for Information Extraction and Segmentation.” In: Proceedings of ICML-2000.
- This paper presents a new Markovian sequence model, closely related to HMMs, that allows observations to be represented as arbitrary overlapping features (such as word, capitalization, formatting, part-of-speech), and defines the conditional probability of state sequences given observation sequences. It does this by using the maximum entropy framework to fit a set of exponential models that represent the probability of a state given an observation and the previous state. We present positive experimental results on the segmentation of FAQ’s.