2005 StackedSequentialLearning

(Cohen & Carvalho, 2005) ⇒ William W. Cohen, and Vitor R. Carvalho. (2005). “Stacked Sequential Learning.” In: Proceedings of the 19th international joint conference on Artificial intelligence.

Subject Headings: Stacked Supervised Sequential Tagging Algorithm.

Notes

Cited By

2011

(Sun, 2011) ⇒ Weiwei Sun. (2011). “A Stacked Sub-word Model for Joint Chinese Word Segmentation and Part-of-speech Tagging.” In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. ISBN:978-1-932432-87-9

Quotes

Abstract

We describe a new sequential learning scheme called “stacked sequential learning ". Stacked sequential learning is a meta-learning algorithm, in which an arbitrary base learner is augmented so as to make it aware of the labels of nearby examples. We evaluate the method on several “sequential partitioning problems", which are characterized by long runs of identical labels. We demonstrate that on these problems, sequential stacking consistently improves the performance of nonsequential base learners; that sequential stacking often improves performance of learners (such as CRFs) that are designed specifically for sequential tasks; and that a sequentially stacked maximum-entropy learner generally outperforms CRFs.

1 Introduction

In this paper, we will consider the application of sequential probabilistic learners to sequential partitioning tasks. Sequential partitioning tasks are sequential classification tasks characterized by long runs of identical labels: examples of these tasks include document analysis, video segmentation, and gene finding.

Motivated by some anomalous behavior observed for one sequential learning method on a particular partitioning task, we will derive a new learning scheme called stacked sequential learning. Like boosting, stacked sequential learning is a meta-learning method, in which an arbitrary base learner is augmented — in this case, by making the learner aware of the labels of nearby examples. Sequential stacking is simple to implement, can be applied to virtually any base learner, and imposes only a constant overhead in training time: in our implementation, the sequentially stacked version of the base learner A trains about seven times more slowly than A.

In experiments on several partitioning tasks, sequential stacking consistently improves the performance of nonsequential base learners. More surprisingly, sequential stacking also often improves performance of learners specifically designed for sequential tasks, such as conditional random fields and discriminatively trained HMMs. Finally, on our set of benchmark problems, a sequentially stacked maximumentropy learner generally outperforms conditional random fields.

2 Motivation: A Hard Task for MEMMs

To motivate the novel learning method that we will describe below, we will first analyze the behavior of one well-known sequential learner on a particular real-world problem. In a recent paper [Carvalho and Cohen, 2004], we evaluated a number of sequential learning methods on the problem of recognizing the “signature” section of an email message. Each line of an email message was represented with a set of handcrafted features, such as “line contains a possible phone number”, “line is blank”, etc. Each email message was represented as a vector x of feature-vectors x1;:::; xn, where x_i is the feature-vector representation of the i-th line of the message. A line was labeled as positive if it was part of a signature section, and negative otherwise. The labels for a message were represented as another vector y, where yi is the label for line i.

The dataset contains 33, 013 labeled lines from 617 email messages. About 10% of the lines are labeled “positive”. Signature sections always fall at the end of a message, usually in the last 10 lines. In the experiments below, the data was split into a training set (of 438 sequences / emails), and a test set with the remaining sequences, and we used the “basic” feature set from Carvalho & Cohen.

The complete dataset is represented as a set S of examples S = f (x1; y1);:::; (xt; yt);:::; (xm; ym) g. Sequential learning is the problem of learning, from such a dataset, a sequential classifier — i.e., a function f such that f (x) produces a vector of class labels y. Clearly, any ordinary non-sequential learning algorithm can be used for sequential learning, by ignoring the sequential nature of the data.

^[1] In the previous paper [Carvalho and Cohen, 2004], we reported results for several non-sequential and sequential learners on the signature-detection problem, including a nonsequential maximum entropy learner [Berger et al., 1996]

Method Noise Error Min Error ME 3.47 3.20 MEMM 31.83 4.26 CRF 1.17 1.17 MEMM 10% 2.18 2.18 CRF 10% 1.85 1.84 Table 1: Performance of several sequential learners on the signature-detection problem.

(henceforth ME) and conditional random fields [Lafferty et al., 2001] (henceforth CRFs). Another plausible sequential learning method to apply to this task are maximumentropy Markov models (MEMMs) [McCallum et al., 2000], also called maximum-entropy taggers [Ratnaparkhi, 1999], conditional Markov models [Klein and Manning, 2002], and recurrent sliding windows [Dietterich, 2002]. In this model, the conditional probability of a label sequence y gQiven an instance sequence x is defined to be Pr (yjx) = i Pr (yijyi¡1; xi). The local model Pr (yijyi¡1; xi) is learned as follows. First one constructs an extended dataset, which is a collection of non-sequential examples of the form ((xi; yi¡1); yi), where (xi; yi¡1) denotes an instance in which the original feature vector for x_i is augmented by adding a feature for yi¡1. We will call (xi; yi¡1) an extended instance, and call yi¡1 a history feature. Note that yi is the class label for the extended example ((xi; yi¡1); yi). After constructing extended instances, one trains a maximum-entropy conditional model from the extended dataset. Inference is done by using a Viterbi search to find the best label sequence y.

MEMMs have a number of nice properties. Relative to the more recently-proposed CRF model, MEMMs are easy to implement, and (since no inference is done at learning time) relatively quick to train. MEMMs can also be easily generalized by replacing the local model with one that uses a longer “history” of k previous labels — i.e., a model of the form Pr (yijyi¡1;:::; yi¡k; xi) — and replacing the Viterbi search with a beam search. Such a learner scales well with the history size and number of possible classes y.

Unfortunately, as Table 1 shows, MEMMs perform extremely badly on the signature-detection problem, with an error rate many times the error rate of CRFs. In fact, on this problem, MEMMs perform much worse than the nonsequential maximum-entropy learner ME.

^[2] The MEMM’s performance is better if one changes the threshold used to classify examples. Letting ^pi be the probability Pr (yi = + jxi; yi¡1) as computed by MEMM, we found, for each learner, the threshold µ such the rule [(yi = +), ([[^pi > µ)]] gives the lowest test error rate. The column labeled “Min Error” in Table 1 gives this “best possible” result. The “Min Error” for MEMMs is much improved, but still higher than non-sequential ME.

The high error occurs because on many test email messages, the learned MEMM makes a false positive classification somewhere before the signature starts, and then “gets stuck” and marks every subsequent line as part of a signature. This behavior is not consistent with previously-described limitations of MEMMs. It is known that MEMMs can represent only a proper subset of the distributions that can be represented by CRFs [Lafferty et al., 2001]; however, this “label bias problem” does not explain why MEMMs perform worse than non-sequential ME, since MEMMs clearly can represent strictly more distributions that ME.

Klein and Manning (2002) also describe an “observation bias problem”, in which MEMMs give too little weight to the history features. However, in this case, relative to the weights assigned by a CRF, MEMM is seems to give too much weight to the history features. To verify this, we encouraged the MEMM to downweight the history features by adding noise to the training (not test) data: for each training email / sequence x, we consider each feature-vector x_i 2 x in turn, and with probability 0.10, we swap line x_i (together with its label yi) with some other xj; yj chosen uniformly from the sequence. Adding this “sequence noise” almost doubles the error rate for CRFs, but reduces the error rate for MEMMs. (Of course, this type of noise does not affect nonsequential ME.) This experiment supports the hypothesis that MEMM is overweighting history features.

3 Stacked Sequential Learning

3.1 Description

…

Footnotes

↑ Specifically, one could build a dataset of non-sequential examples (xt; i; yt; i) from S, and use it to train a classifier g that maps a single feature-vector x to a label y. One can then use g to classify each instance x_i in the vector x = hx1;:::; xni separately, ignoring its sequential position, and append the resulting predictions yi into an output vector y.
↑ We used the implementations of ME, MEMMs, and CRFs provided by Minorthird [Minorthird, 2004], with a limit of 50 optimization iterations. This limit does not substantially change the results.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2005 StackedSequentialLearning	William W. Cohen Weiwei Sun Vitor R. Carvalho			Stacked Sequential Learning						2005

[1] Specifically, one could build a dataset of non-sequential examples (xt; i; yt; i) from S, and use it to train a classifier g that maps a single feature-vector x to a label y. One can then use g to classify each instance x_i in the vector x = hx1;:::; xni separately, ignoring its sequential position, and append the resulting predictions yi into an output vector y.

[2] We used the implementations of ME, MEMMs, and CRFs provided by Minorthird [Minorthird, 2004], with a limit of 50 optimization iterations. This limit does not substantially change the results.

[1]

[2]