2002 DiscriminativeTrainingMethodsForHMM

(Collins, 2002b) ⇒ Michael Collins. (2002). “Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with the Perceptron Algorithm.” In: Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing, (EMNLP 2002). doi:10.3115/1118693.1118694

Subject Headings: Voted Perceptron Model, Part-of-Speech Tagging Task, Base Noun Phrase Chunking Task, Discriminative Training Algorithm.

Notes

This is a companion paper to (Collins, 2002a)

Cited By

~558 …

2006

(Richardon & Domingos, 2006) ⇒ Matthew Richardson, and Pedro Domingos. (2006). “Markov Logic Networks.” In: Machine Learning, 62. doi:10.1007/s10994-006-5833-1.

2005

(Collins & Koo, 2005) ⇒ Michael Collins, and Terry Koo. (2005). “Discriminative Reranking for Natural Language Parsing.” In: Computational Linguistics, 31(1) doi:10.1162/0891201053630273

2003

(Sha & Pereira, 2003a) ⇒ Fei Sha, and Fernando Pereira. (2003). “Shallow Parsing with Conditional Random Fields.” In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT-NAACL 2003). doi:10.3115/1073445.1073473

Quotes

Abstract

We describe new algorithms for training tagging Model|models, as an alternative to maximum-entropy models or conditional random fields (CRFs). The algorithms rely on Viterbi decoding of training examples, combined with simple additive updates. We describe theory justifying the algorithms through a modification of the proof of convergence of the perceptron algorithm for classification problems. We give experimental results on part-of-speech tagging and base noun phrase chunking, in both cases showing improvements over results for a maximum-entropy tagger.

…

2 Parameter Estimation

2.1 HMM Taggers

... As an alternative to maximum–likelihood parameter estimates, this paper will propose the following estimation algorithm. Say the training set consists of [math]\displaystyle{ n }[/math] tagged sentences, the i^th sentence being of length n_i

... Maximum-entropy models represent the tagging task through a feature-vector representation of history-tag pairs. A feature vector representation : H×T ! Rd is a function that maps a history–tag pair to a d-dimensional feature vector. Each component s(h, t) for s = 1 . . . . could be an arbitrary function of (h, t). It is common (e.g., see (Ratnaparkhi 96)) for each feature s to be an indicator function. For example, one such feature might be Theta₁₀₀₀(h,t) = 1 if current word w_i is “the” and t=DT otherwise.

2.2 Local and Global Feature Vectors

We now describe how to generalize the algorithm to more general representations of tagged sequences. In this section we describe the feature-vector representations which are commonly used in maximum-entropy models for tagging, and which are also used in this paper.

…

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2002 DiscriminativeTrainingMethodsForHMM	Michael Collins			Discriminative Training Methods for Hidden Markov Models: Theory and experiments with the perceptron algorithm		Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing	http://www.ai.mit.edu/people/mcollins/papers/tagperc.pdf	10.3115/1118693.1118694		2002