2003 EfficientlyInducingFeaturesofCo

(Redirected from McCallum, 2003)
Jump to: navigation, search

Subject Headings: Sequence Tagging Feature.


Cited By



Conditional Random Fields (CRFs) are undirected graphical models, a special case of which correspond to conditionally-trained finite state machines. A key advantage of CRFs is their great flexibility to include a wide variety of arbitrary, non-independent features of the input. Faced with this freedom, however, an important question remains: what features should be used? This paper presents an efficient feature induction method for CRFs. The method is founded on the principle of iteratively constructing feature conjunctions that would significantly increase conditional log-likelihood if added to the model. Automated feature induction enables not only improved accuracy and dramatic reduction in parameter count, but also the use of larger cliques, and more freedom to liberally hypothesize atomic input variables that may be relevant to a task. The method applies to linear-chain CRFs, as well as to more arbitrary CRF structures, such as Relational Markov Networks, where it corresponds to learning clique templates, and can also be understood as supervised structure learning. Experimental results on named entity extraction and noun phrase segmentation tasks are presented.

1 Introduction

Many tasks are best performed by models that have the flexibility to use arbitrary, overlapping, multi-granularity and non-independent features. For example, in natural language tasks, the need for labeled data can be drastically reduced by using features that take advantage of domain knowledge in the form of word lists, part-of-speech tags, character n-grams, capitalization patterns, page layout and font information. It is difficult to capture such inter-dependent features with a generative probabilistic model because the dependencies among generated variables should be explicitly captured in order to reproduce the data. However, conditional probability models, such as conditional maximum entropy classifiers, need not capture dependencies among variables on which they condition, but do not generate. There has been significant work, for instance, with such models for greedy sequence modeling in NLP, e.g. (Ratnaparkhi, 1996; Borthwick et al., 1998). Conditional Random Fields (CRFs) (Lafferty et al., 2001) are undirected graphical models, trained to maximize the conditional probability of the outputs given the inputs. When the edges among the output variables form a linear chain, they correspond to conditionally-trained finite state machines. While based on the same exponential form as maximum entropy models, they have efficient procedures for complete, non-greedy finite-state inference and training. CRFs have achieved empirical success recently in POS tagging (Lafferty et al., 2001), noun phrase segmentation (Sha & Pereira, 2003) and table extraction from government reports (Pinto et al., 2003).

Given these models’ great flexibility to include a wide array of features, an important question that remains is what features should be used? Some features are atomic and provided, but since CRFs are log-linear models, one will also want to gain expressive power by using some conjunctions. Previous standard approaches build large set of feature conjunctions according to hand-built, general patterns. This can result in extremely large feature sets, with millions of features, e.g. (Sha & Pereira, 2003).

However, even with this many parameters, the feature set is still restricted. For example, in some cases capturing a word tri-gram is important, but there is not sufficient memory or computation to include all word tri-grams. As the number of overlapping atomic features increases, the difficulty and importance of constructing only select feature combinations grows.

This paper presents a feature induction method for arbitrarily-structured and linear-chain CRFs. Founded on the principle of constructing only those feature conjunctions that significantly increase log-likelihood, the approach builds on that of Della Pietra et al. (1997), but is altered to work with conditional rather than joint probabilities, and with a mean-field approximation and other modifications to improve efficiency specifically for a conditional model. In comparison with traditional approaches, automated feature induction offers both improved accuracy and significantly reduction in feature count; it enables the use of richer, higher-order Markov models; and offers more freedom to liberally guess about which atomic features may be relevant to a task.

We present results on two natural language tasks. The CoNLL-2003 named entity recognition shared task consists of Reuters news articles with tagged entities PERSON, LOCATION, ORGANIZATION and MISC. The data is quite complex, including foreign person names (such as Yayuk Basuki and Innocent Butare), a wide diversity of locations (including sports venues such as The Oval, and rare location names such as Nirmal Hriday), many types of organizations (from company names such as 3M, to acronyms for political parties such as KDP, to location names used to refer to sports teams such as Cleveland), and a wide variety of miscellaneous named entities (from software such as Java, to nationalities such as Basque, to sporting competitions such as 1,000 Lakes Rally).

On this task feature induction reduces error by 40% (increasing F1 from 73% to 89%) in comparison with fixed, hand-constructed conjunction patterns. There is evidence that the fixed-pattern model is severely overfitting, and that feature induction reduces overfitting while still allowing use of large, rich knowledge-laden feature sets. On a standard noun phrase segmentation task we match world-class accuracy while using faran order of magnitude fewer features.

6 Conclusions

Conditional random fields provide tremendous flexibility to include a great diversity of features. The paper has presented an efficient method of automatically inducing features that most improve conditional log-likelihood. The experimental results are quite positive.

We have focused here on inducing new conjunctions (or cliques) of the input variables, however the method also naturally applies to inducing new cliques of the output variables, or input and output variables combined. This corresponds to structure learning and “clique template” learning for conditional Markov networks, such as Relational Markov Networks (Taskar et al., 2002), and experimental exploration in this area is a topic of future work.



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2003 EfficientlyInducingFeaturesofCoAndrew McCallumEfficiently Inducing Features of Conditional Random Fields2003
AuthorAndrew McCallum +
titleEfficiently Inducing Features of Conditional Random Fields +
year2003 +