2010 ConfidenceinStructuredPredictio

(Mejer & Crammer, 2010) ⇒ Avihai Mejer, and Koby Crammer. (2010). “Confidence in Structured-prediction Using Confidence-weighted Models.” In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP 2010).

Subject Headings: Sequence Chunk Confidence Estimation, Confidence Estimation.

Notes

Cited By

Quotes

Abstract

Confidence-Weighted linear classifiers (CW) and its successors were shown to perform well on binary and multiclass NLP problems. In this paper we extend the CW approach for sequence learning and show that it achieves state-of-the-art performance on four noun phrase chucking and named entity recognition tasks. We then derive few algorithmic approaches to estimate the prediction's correctness of each label in the output sequence. We show that our approach provides a reliable relative correctness information as it outperforms other alternatives in ranking label-predictions according to their error. We also show empirically that our methods output close to absolute estimation of error. Finally, we show how to use this information to improve active learning.

1 Introduction

In the past decade structured classification has seen much interest by the machine learning community. After the introduction of conditional random fields (CRFs) (Lafferty et al., 2001), and maximum margin Markov networks (Taskar et al., 2003), which are batch algorithms, new online method were introduced. For example the passive-aggressive algorithm was adapted to chunking (Shimizu and Haas, 2006), parsing (McDonald et al., 2005b), learning preferences (Wick et al., 2009) and text segmentation (McDonald et al., 2005a). These new online algorithms are fast to train and simple to implement, yet they generate models that output merely a prediction with no additional information, as opposed to probabilistic models like CRFs or HMMs.

In this work we fill this gap proposing few alternatives to compute confidence in the output of discriminative non-probabilistic algorithms. As before, our algorithms output the highest-scoring labeling. However, they also compute additional labelings, that are used to compute the per word confidence in its labelings. We build on the recently introduced confidence-weighted learning (Dredze et al., 2008; Crammer et al., 2009b) and induce a distribution over labelings from the distribution maintained over weight-vectors.

We show how to compute confidence estimates in the label predicted per word, such that the confidence reflects the probability that the label is not correct. We then use this confidence information to rank all labeled words (in all sentences). This can be thought of as a retrieval of the erroneous words, which can than be passed to human annotator for an examination, either to correct these mistakes or as a quality control component. Next, we show how to apply our techniques to active learning over sequences. We evaluate our methods on four NP chunking and NER datasets and demonstrate the usefulness of our methods. Finally, we report the performance of obtained by CW-like adapted to sequence prediction, which are comparable with current state-of-the-art algorithms.

2 Confidence-Weighted Learning

Consider the following online binary classification problem that proceeds in rounds. On the ith round the online algorithm receives an input x i ∈ R d and 971 applies its current rule to make a prediction ˆ y i ∈Y , for the binary set Y = {− 1 , +1 } . It then receives the correct label y i ∈Y and suffers a loss ` ( y i , ˆ y i ) . At this point, the algorithm updates its prediction rule with the pair ( x i ,y i ) and proceeds to the next round. A summary of online algorithms can be found in (Cesa-Bianchi and Lugosi, 2006).

Online confidence-weighted (CW) learning (Dredze et al., 2008; Crammer et al., 2008), generalized the passive-aggressive (PA) update principle to multivariate Gaussian distributions over the weight vectors - N ( μ , Σ) - for binary classification. The mean μ ∈ R d contains the current estimate for the best weight vector, whereas the Gaussian covariance matrix Σ ∈ R d × d captures the confidence in this estimate. More precisely, the diagonal elements Σ p,p, capture the confidence in the value of the corresponding weight μ p; the smaller the value of Σ p,p, is, the more confident is the model in the value of μ p . The off-diagonal elements Σ p,q for p 6 = q capture the correlation between the values of μ p and μ q . When the data is of large dimension, such as in natural language processing, a model that maintains a full covariance matrix is not feasible and we back-off to diagonal covariance matrices.

CW classifiers are trained according to a PA rule that is modified to track differences in Gaussian dis- tributions. At each round, the new mean and co- variance of the weight vector distribution is chosen to be the solucion of an optimization problem (see (Crammer et al., 2008) for details). This particu- lar CW rule may over-fit by construction. A more recent alternative scheme called AROW (adaptive regularization of weight-vectors) (Crammer et al., 2009b) replaces the guaranteed prediction at each round with the a more relaxed objective (see (Cram- mer et al., 2009b)). AROW has been shown to perform well in practice, especially for noisy data where CW severely overfits.

…

3 Sequence Labeling

In the sequence labeling setting, instances [math]\displaystyle{ x }[/math] belong to a general input space [math]\displaystyle{ \mathcal{X} }[/math] and conceptually are composed of a finite number [math]\displaystyle{ n }[/math] of components, such as words of a sentence. The number of components [math]\displaystyle{ n = |x| }[/math] varies between instances. Each part of an instance is labelled from a finite set [math]\displaystyle{ \mathcal{Y} }[/math], [math]\displaystyle{ |\mathcal{Y}| = K }[/math]. That is, a labeling of an entire instance belongs to the product set [math]\displaystyle{ y \in \mathcal{Y} × \mathcal{Y} … \mathcal{Y} }[/math] ([math]\displaystyle{ n }[/math] times).

We employ a general approach (Collins, 2002; Crammer et al., 2009a) to generalize binary classification and use a joined feature mapping of an instance [math]\displaystyle{ x }[/math] and a labeling [math]\displaystyle{ y }[/math] into a common vector space, [math]\displaystyle{ \Phi(x, y) \in \mathbb{R}^d }[/math].

Given an input instance [math]\displaystyle{ x }[/math] and a model [math]\displaystyle{ μ \in \mathbb{R}^d }[/math] we predict the labeling with the highest score, [math]\displaystyle{ \hat{y} = \text{arg max}_z\mu \cdot \bf{\Phi}(x,z) }[/math]. A brute-force approach evaluates the value of the score [math]\displaystyle{ μ \cdot \Phi(x, z) }[/math] for each possible labeling [math]\displaystyle{ z \in \mathcal{Y}^n }[/math], which is not feasible for large values of [math]\displaystyle{ n }[/math]. Instead, we follow standard factorization and restrict the joint mapping to be of the form, [math]\displaystyle{ \Phi(x,y) = \sum^{n}_{p=1} \Phi(x,y_p) + \sum^{n}_{q=2} \Phi(x, y_q, y_q-1) }[/math]. That is, the mapping is a sum of mappings, each taking into consideration only a label of a single part, or two consecutive parts. The time required to compute the max operator is linear in n and quadratic in [math]\displaystyle{ K }[/math] using the dynamic-programming Viterbi algorithm.

After the algorithm made a prediction, it uses the current labeled instance [math]\displaystyle{ (x_i, y_i) }[/math] to update the model. We now define the update rule both for a version of CW and for AROW for structured learning, staring with CW.

…

6 Active Learning

Encouraged by the success of the KD-PC and KD-Fixed algorithms in estimating the confidence in the prediction we apply these methods to the task of active learning. In active learning, the algorithm is given a large set of unlabeled data and a small set of labeled data and works in iterations. On each iteration, the overall labeled data at this point is used to build a model, which is then used to choose new subset of examples to be annotated.

…

Related Work

Most previous work has focused on confidence estimation for an entire example or some fields of an entry (Culotta and McCallum, 2004) using CRFs. (Kristjansson et al., 2004) show the utility of confidence estimation is extracted fields of an interactive information extraction system by high-lighting low confidence fields for the user. (Scheffer et al., 2001) estimate confidence of single token label in HMM based information extraction system by a method similar to the Delta method we used. (Ueffing and Ney, 2007) propose several methods for word level confidence estimation for the task of machine translation. One of the methods they use is very similar to the weighted and non-weighted K-best Viterbi methods we used with the proper adjustments to t

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2010 ConfidenceinStructuredPredictio	Koby Crammer Avihai Mejer			Confidence in Structured-prediction Using Confidence-weighted Models