# 2002 DiscriminativeTrainingMethodsForHMM

- (Collins, 2002b) ⇒ Michael Collins. (2002). “Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with the Perceptron Algorithm.” In: Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing, (EMNLP 2002). doi:10.3115/1118693.1118694

**Subject Headings:** Voted Perceptron Model, Part-of-Speech Tagging Task, Base Noun Phrase Chunking Task, Discriminative Training Algorithm.

## Notes

- This is a companion paper to (Collins, 2002a)

## Cited By

- ~558 …

### 2006

- (Richardon & Domingos, 2006) ⇒ Matthew Richardson, and Pedro Domingos. (2006). “Markov Logic Networks.” In: Machine Learning, 62. doi:10.1007/s10994-006-5833-1.

### 2005

- (Collins & Koo, 2005) ⇒ Michael Collins, and Terry Koo. (2005). “Discriminative Reranking for Natural Language Parsing.” In: Computational Linguistics, 31(1) doi:10.1162/0891201053630273

### 2003

- (Sha & Pereira, 2003a) ⇒ Fei Sha, and Fernando Pereira. (2003). “Shallow Parsing with Conditional Random Fields.” In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT-NAACL 2003). doi:10.3115/1073445.1073473

## Quotes

### Abstract

We describe new algorithms for training tagging Model|models, as an alternative to maximum-entropy models or conditional random fields (CRFs). The algorithms rely on Viterbi decoding of training examples, combined with simple additive updates. We describe theory justifying the algorithms through a modification of the proof of convergence of the perceptron algorithm for classification problems. We give experimental results on part-of-speech tagging and base noun phrase chunking, in both cases showing improvements over results for a maximum-entropy tagger.

…

### 2 Parameter Estimation

### 2.1 HMM Taggers

... As an alternative to maximum–likelihood parameter estimates, this paper will propose the following estimation algorithm. Say the training set consists of [math]\displaystyle{ n }[/math] tagged sentences, the *i*^{th} sentence being of length *n*_{i}

... Maximum-entropy models represent the tagging task through a feature-vector representation of history-tag pairs. A feature vector representation : H×T ! Rd is a function that maps a history–tag pair to a d-dimensional feature vector. Each component s(h, t) for s = 1 . . . . could be an arbitrary function of (h, t). It is common (e.g., see (Ratnaparkhi 96)) for each feature s to be an indicator function. For example, one such feature might be Theta_{1000}(*h*,*t*) = 1 if current word *w*_{i} is “*the*” and *t*=`DT`

otherwise.

### 2.2 Local and Global Feature Vectors

We now describe how to generalize the algorithm to more general representations of tagged sequences. In this section we describe the feature-vector representations which are commonly used in maximum-entropy models for tagging, and which are also used in this paper.

…

,

Author | volume | Date Value | title | type | journal | titleUrl | doi | note | year | |
---|---|---|---|---|---|---|---|---|---|---|

2002 DiscriminativeTrainingMethodsForHMM | Michael Collins | Discriminative Training Methods for Hidden Markov Models: Theory and experiments with the perceptron algorithm | Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing | http://www.ai.mit.edu/people/mcollins/papers/tagperc.pdf | 10.3115/1118693.1118694 | 2002 |