2003 ShallowParsingwithConditionalRa

Quotes

Abstract

Conditional random fields for sequence labeling offer advantages over both generative models like HMMs and classifiers applied at each sequence position. Among sequence labeling tasks in language processing, shallow parsing has received much attention, with the development of standard evaluation datasets and extensive comparison among methods. We show here how to train a conditional random field to achieve performance as good as any reported base noun-phrase chunking method on the CoNLL task, and better than any reported single model. Improved training methods based on modern optimization algorithms were critical in achieving these results. We present extensive comparisons between models and training methods that confirm and strengthen previous results on shallow parsing and training methods for maximum-entropy models.

1. Introduction

Sequence analysis tasks in language and biology are often described as mappings from input sequences to sequences of labels encoding the analysis. In language processing, examples of such tasks include part-of-speech tagging, named-entity recognition, and the task we shall focus on here, shallow parsing. Shallow parsing identifies the non-recursive cores of various phrase types in text, possibly as a precursor to full parsing or information extraction (Abney, 1991). The paradigmatic shallow-parsing problem is NP chunking, which finds the nonrecursive cores of noun phrases called base NPs. The pioneering work of Ramshaw and Marcus (1995) introduced NP chunking as a machine-learning problem, with standard datasets and evaluation metrics. The task was extended to additional phrase types for the CoNLL-2000 shared task (Tjong Kim Sang and Buchholz, 2000), which is now the standard evaluation task for shallow parsing.

Most previous work used two main machine-learning approaches to sequence labeling. The first approach relies on k-order generative probabilistic models of paired input sequences and label sequences, for instance hidden Markov models (HMMs) (Freitag and McCallum, 2000; Kupiec, 1992) or multilevel Markov models (Bikel et al., 1999). The second approach views the sequence labeling problem as a sequence of classification problems, one for each of the labels in the sequence. The classification result at each position may depend on the whole input and on the previous k classifications. (1. Ramshaw and Marcus (1995) used transformation-based learning (Brill, 1995), which for the present purposes can be tought of as a classification-based method.)

The generative approach provides well-understood training and decoding algorithms for HMMs and more general graphical models. However, effective generative models require stringent conditional independence assumptions. For instance, it is not practical to make the label at a given position depend on a window on the input sequence as well as the surrounding labels, since the inference problem for the corresponding graphical model would be intractable. Non-independent features of the inputs, such as capitalization, suffixes, and surrounding words, are important in dealing with words unseen in training, but they are difficult to represent in generative models.

The sequential classification approach can handle many correlated features, as demonstrated in work on maximum-entropy (McCallum et al., 2000; Ratnaparkhi, 1996) and a variety of other linear classifiers, including winnow (Punyakanok and Roth, 2001), AdaBoost (Abney et al., 1999), and support-vector machines (Kudo and Matsumoto, 2001). Furthermore, they are trained to minimize some function related to labeling error, leading to smaller error in practice if enough training data are available. In contrast, generative models are trained to maximize the joint probability of the training data, which is not as closely tied to the accuracy metrics of interest if the actual data was not generated by the model, as is always the case in practice.

However, since sequential classifiers are trained to make the best local decision, unlike generative models they cannot trade off decisions at different positions against each other. In other words, sequential classifiers are myopic about the impact of their current decision on later decisions (Bottou, 1991; Lafferty et al., 2001). This forced the best sequential classifier systems to resort to heuristic combinations of forward-moving and backward-moving sequential classifiers (Kudo and Matsumoto, 2001).

Conditional random fields (CRFs) bring together the best of generative and classification models. Like classification models, they can accommodate many statistically correlated features of the inputs, and they are trained discriminatively. But like generative models, they can trade off decisions at different sequence positions to obtain a globally optimal labeling. Lafferty et al. (2001) showed that CRFs beat related classification models as well as HMMs on synthetic data and on a part-of-speech tagging task.

In the present work, we show that CRFs beat all reported single-model NP chunking results on the standard evaluation dataset, and are statistically indistinguishable from the previous best performer, a voting arrangement of 24 forward- and backward-looking support-vector classifiers (Kudo and Matsumoto, 2001). To obtain these results, we had to abandon the original iterative scaling CRF training algorithm for convex optimization algorithms with better convergence properties. We provide detailed comparisons between training methods.

The generalized perceptron proposed by Collins (2002) is closely related to CRFs, but the best CRF training methods seem to have a slight edge over the generalized perceptron.

2 Conditional Random Fields

We focus here on conditional random fields on sequences, although the notion can be used more generally (Lafferty et al., 2001; Taskar et al., 2002). Such CRFs define conditional probability distributions p(Y|X) of label sequences given input sequences. We assume that the random variable sequences X and Y have the same length, and use x = x_1 … x_n and y = y_1 … y_n for the generic input sequence and label sequence, respectively.

A CRF on (X; Y ) is specified by a vector f of local features and a corresponding weight vector $\lambda$. Each local feature is either a state feature s(y; x; i) or a transition feature t(y; y0; x; i), where y; y0 are labels, x an input sequence, and i an input position. … …

3 Training Methods

Lafferty et al. (2001) used iterative scaling algorithms for CRF training, following earlier work on maximumentropy models for natural language (Berger et al., 1996; Della Pietra et al., 1997). …

4 Shallow Parsing

Figure 1 shows the base NPs in an example sentence. Following Ramshaw and Marcus (1995), the input to the NP chunker consists of the words in a sentence annotated automatically with part-of-speech (POS) tags. The chunker's task is to label each word with a label indicating whether the word is outside a chunk (O), starts a chunk (B), or continues a chunk (I). For example, the tokens in first line of Figure 1 would be labeled BIIBIIOBOBIIO.

[Rockwell International Corp.] ['s Tulsa unit] said it signed a tentative agreement extending [its contract] with [Boeing Co.] to provide structural parts for [Boeing] 's [747 jetliners] .

Figure 1: NP chunks

4.1 Data Preparation

NP chunking results have been reported on two slightly different data sets: the original RM data set of Ramshaw and Marcus (1995), and the modified CoNLL-2000 version of Tjong Kim Sang and Buchholz (2000). Although the chunk tags in theRM and CoNLL-2000 are somewhat different, we found no significant accuracy differences between models trained on these two data sets. Therefore, all our results are reported on the CoNLL-2000 data set. We also used a development test set, provided by Michael Collins, derived from WSJ section 21 tagged with the Brill (1995) POS tagger.

4.2 CRFs for Shallow Parsing

Our chunking CRFs have a second-order Markov dependency between chunk tags. This is easily encoded by making the CRF labels pairs of consecutive chunk tags. That is, the label at position i is yi = ci..1ci, where ci is the chunk tag of word i, one of O, B, or I. Since B must be used to start a chunk, the label OI is impossible. In addition, successive labels are constrained: yi..1 = ci..2ci..1, yi = ci..1ci, and c0 = O. These constraints on the model topology are enforced by giving appropriate features a weight of..1, forcing all the forbidden labelings to have zero probability.

Our choice of features was mainly governed by computing power, since we do not use feature selection and all features are used in training and testing. We use the following factored representation for features

$f(y_{i-1}; y_i; \bsx; i) = p(\bsx; i)q(y_{i..1}; y_i) (4)$

where p(x; i) is a predicate on the input sequence x and current position i and q(yi..1; yi) is a predicate on pairs of labels. For instance, p(x; i) might be .word at position i is the. or .the POS tags at positions i .. 1; i are DT, NN.. Because the label set is finite, such a factoring of f(yi..1; yi; x; i) is always possible, and it allows each input predicate to be evaluated just once for many features that use it, making it possible to work with millions of features on large training sets.

q(yi..1; yi)                 p(x; i)
-----
yi = y                       true
yi = y; yi..1 = y0
c(yi) = c
-----
yi = y wi = w
or
c(yi) = c wi+1 = w
----
wi..1 = w
wi..2 = w
wi+2 = w
wi..1 = w0; wi = w
wi+1 = w0; wi = w
ti = t
ti..1 = t
ti+1 = t
ti..2 = t
ti+2 = t
ti..1 = t0; ti = t
ti..2 = t0; ti..1 = t
ti = t0; ti+1 = t
ti+1 = t0; ti+2 = t
ti..2 = t00; ti..1 = t0; ti = t
ti..1 = t00; ti = t0; ti+1 = t
ti = t00; ti+1 = t0; ti+2 = t
Table 1: Shallow parsing features


Table 1 summarizes the feature set. For a given position i, wi is the word, ti its POS tag, and yi its label. For any label y = c0c, c(y) = c is the corresponding chunk tag. For example, c(OB) = B. The use of chunk tags as well as labels provides a form of backoff from the very small feature counts that may arise in a secondorder model, while allowing significant associations between tag pairs and input predicates to be modeled. To save time in some of our experiments, we used only the 820,000 features that are supported in the CoNLL training set, that is, the features that are on at least once. For our highest F score, we used the complete feature set, around 3.8 million in the CoNLL training set, which contains all the features whose predicate is on at least once in the training set. The complete feature set may in principle perform better because it can place negative weights on transitions that should be discouraged if a given predicate is on.

4.3 Parameter Tuning

As discussed previously, we need a Gaussian weight prior to reduce overfitting. We also need to choose the number of training iterations since we found that the best F score is attained while the log-likelihood is still improving. The reasons for this are not clear, but the Gaussian prior may not be enough to keep the optimization from making weight adjustments that slighly improve training log-likelihood but cause large F score �uctuations. We used the development test set mentioned in Section 4.1 to set the prior and the number of iterations.

4.4 Evaluation Metric

The standard evaluation metrics for a chunker are precision P (fraction of output chunks that exactly match the reference chunks), recall R (fraction of reference chunks returned by the chunker), and their harmonic mean, the F1 score F1 = 2 � P � R=(P + R) (which we call just F score in what follows). The relationships between F score and labeling error or log-likelihood are not direct, so we report both F score and the other metrics for the models we tested. For comparisons with other reported results we use F score.

4.5 Significance Tests

Ideally, comparisons among chunkers would control for feature sets, data preparation, training and test procedures, and parameter tuning, and estimate the statistical significance of performance differences. Unfortunately, reported results sometimes leave out details needed for accurate comparisons. We report F scores for comparison with previous work, but we also give statistical signifi- cance estimates using McNemar's test for those methods that we evaluated directly.

Testing the significance of F scores is tricky because the wrong chunks generated by two chunkers are not directly comparable. Yeh (2000) examined randomized tests for estimating the significance of F scores, and in particular the bootstrap over the test set (Efron and Tibshirani, 1993; Sang, 2002). However, bootstrap variances in preliminary experiments were too high to allow any conclusions, so we used instead a McNemar paired test on labeling disagreements (Gillick and Cox, 1989).

5 Results

All the experiments were performed with our Java implementation of CRFs,designed to handle millions of features, on 1.7 GHz Pentium IV processors with Linux and IBM Java 1.3.0. Minor variants support voted perceptron (Collins, 2002) and MEMMs (McCallum et al., 2000) with the same efficient feature encoding. GIS, CG, and L-BFGS were used to train CRFs and MEMMs.

6 Conclusions

We have shown that (log-)linear sequence labeling models trained discriminatively with general-purpose optimization methods are a simple, competitive solution to learning shallow parsers. These models combine the best features of generative finite-state models and discriminative (log-)linear classifiers, and do NP chunking as well as or better than "ad hoc" classifier combinations, which were the most accurate approach until now. In a longer version of this work we will also describe shallow parsing results for other phrase types. There is no reason why the same techniques cannot be used equally successfully for the other types or for other related tasks, such as POS tagging or named-entity recognition.

On the machine-learning side, it would be interesting to generalize the ideas of large-margin classification to sequence models, strengthening the results of Collins (2002) and leading to new optimal training algorithms with stronger guarantees against overfitting.

On the application side, (log-)linear parsing models have the potential to supplant the currently dominant lexicalized PCFG models for parsing by allowing much richer feature sets and simpler smoothing, while avoiding the label bias problem that may have hindered earlier classifier-based parsers (Ratnaparkhi, 1997). However, work in that direction has so far addressed only parse reranking (Collins and Duffy, 2002; Riezler et al., 2002). Full discriminative parser training faces significant algorithmic challenges in the relationship between parsing alternatives and feature values (Geman and Johnson, 2002) and in computing feature expectations.

References

,

volumeDate ValuetitletypejournaltitleUrldoinoteyear
2003 ShallowParsingwithConditionalRaShallow Parsing with Conditional Random Fields10.3115/1073445.10734732003
 Author Fei Sha + and Fernando Pereira + doi 10.3115/1073445.1073473 + title Shallow Parsing with Conditional Random Fields + year 2003 +