2004 ConfidenceEstimationforInformat

(Culotta & McCallum, 2004) ⇒ Aron Culotta, and Andrew McCallum. (2004). “Confidence Estimation for Information Extraction.” In: Proceedings of HLT-NAACL 2004: Short Papers.

Subject Headings: Prediction Confidence Estimate, Chunking Model.

Notes

Cited By

Quotes

Abstract

Information extraction techniques automatically create structured databases from unstructured data sources, such as the Web or newswire documents. Despite the successes of these systems, accuracy will always be imperfect. For many reasons, it is highly desirable to accurately estimate the confidence the system has in the correctness of each extracted field. The information extraction system we evaluate is based on a linear-chain conditional random field (CRF), a probabilistic model which has performed well on information extraction tasks because of its ability to capture arbitrary, overlapping features of the input in a Markov model. We implement several techniques to estimate the confidence of both extracted fields and entire multi-field records, obtaining an average precision of 98% for retrieving correct fields and 87% for multi-field records

1 Introduction

Information extraction usually consists of tagging a sequence of words (e.g. a Web document) with semantic labels (e.g. PERSON NAME, PHONE NUMBER) and depositing these extracted fields into a database. Because automated information extraction will never be perfectly accurate, it is helpful to have an effective measure of the confidence that the proposed database entries are correct. There are at least three important applications of accurate confidence estimation. First, accuracy-coverage trade-offs are a common way to improve data integrity in databases. Efficiently making these trade-offs requires an accurate prediction of correctness.

Second, confidence estimates are essential for interactive information extraction, in which users may correct incorrectly extracted fields. These corrections are then automatically propagated in order to correct other mistakes in the same record. Directing the user to the least confident field allows the system to improve its performance with a minimal amount of user effort. Kristjannson et al. (2004) show that using accurate confidence estimation reduces error rates by 46%.

Third, confidence estimates can improve performance of data mining algorithms that depend upon databases created by information extraction systems (McCallum and Jensen, 2003). Confidence estimates provide data mining applications with a richer set of “bottom-up” hypotheses, resulting in more accurate inferences. An example of this occurs in the task of citation co-reference resolution. An information extraction system labels each field of a paper citation (e.g. AUTHOR, TITLE), and then co-reference resolution merges disparate references to the same paper. Attaching a confidence value to each field allows the system to examine alternate labelings for less confident fields to improve performance.

Sound probabilistic extraction models are most conducive to accurate confidence estimation because of their intelligent handling of uncertainty information. In this work we use conditional random fields (Lafferty et al., 2001), a type of undirected graphical model, to automatically label fields of contact records. Here, a record is an entire block of a person’s contact information, and a field is one element of that record (e.g. COMPANY NAME). We implement several techniques to estimate both field confidence and record confidence, obtaining an average precision of 98% for fields and 87% for records

…

3 Field Confidence Estimation

The Viterbi algorithm finds the most likely state sequence matching the observed word sequence. The word that Viterbi matches with a particular FSM state is extracted as belonging to the corresponding database field. We can obtain a numeric score for an entire sequence, and then turn this into a probability for the entire sequence by normalizing. However, to estimate the confidence of an individual field, we desire the probability of a subsequence, marginalizing out the state selection for all other parts of the sequence. A specialization of Forward-Backward, termed Constrained Forward-Backward (CFB), returns exactly this probability.

Because CRFs are conditional models, Viterbi finds the most likely state sequence given an observation sequence, defined as [math]\displaystyle{ s^* = \operatorname{argmax}_S p_\Lambda (\bf{s}|\bf{o}) }[/math]. To avoid an exponential-time search over all possible settings of [math]\displaystyle{ s }[/math], Viterbi stores the probability of the most likely path at time [math]\displaystyle{ t }[/math] that accounts for the first [math]\displaystyle{ t }[/math] observations and ends in state [math]\displaystyle{ s_i }[/math]. Following traditional notation, we define this probability to be [math]\displaystyle{ \delta_t(s_i) }[/math], where [math]\displaystyle{ \delta_0(s_i) }[/math] is the probability of starting in each state [math]\displaystyle{ s_i }[/math], and the recursive formula is:

[math]\displaystyle{ \delta_{t+1}(s_i) = \underset{s'}{\operatorname{max}} \biggl[ \delta_t(s') \biggl( \sum_k \lambda_k f_k(s',s_i,o,t) \biggr) \biggr] }[/math] (2)

terminating in [math]\displaystyle{ s^* = \underset{s_1 \le s_i \le s_N}{\operatorname{argmax}} [\delta_T(S_i)] }[/math].

The Forward-Backward algorithm can be viewed as a generalization of the Viterbi algorithm: instead of choosing the optimal state sequence, Forward-Backward evaluates all possible state sequences given the observation sequence. The “forward values” [math]\displaystyle{ \alpha_{t+1}(s_i) }[/math] are recursively defined similarly as in Eq. 2, except the max is replaced by a summation. Thus we have

[math]\displaystyle{ \alpha_{t+1}(s_i) = \sum_{s'} \bigl[ \alpha_t(s') \operatorname{exp} \bigl( \sum_k \lambda_k f_k(s',s_i,\bf{o},t)\bigr) \bigr] }[/math]. (3)

terminating in [math]\displaystyle{ Z_o = \sum_i \alpha_T(s_i) }[/math] from Eq. 1.

To estimate the probability that a field is extracted correctly, we constrain the Forward-Backward algorithm such that each path conforms to some subpath of constraints [math]\displaystyle{ C = \lt s_q … s_r\gt }[/math] from time step [math]\displaystyle{ q }[/math] to [math]\displaystyle{ r }[/math]. Here, [math]\displaystyle{ s_q \in C }[/math] can be either a positive constraint (the sequence must pass through [math]\displaystyle{ s_q }[/math]) or a negative constraint (the sequence must not pass through [math]\displaystyle{ s_q }[/math]).

In the context of information extraction, C corresponds to an extracted field. The positive constraints specify the observation tokens labeled inside the field, and the negative constraints specify the field boundary. For example, if we use states names B-TITLE and I-JOBTITLE to label tokens that begin and continue a JOBTITLE field, and the system labels observation sequence [math]\displaystyle{ \lt o_2,...,o_5\gt }[/math] as a JOBTITLE field, then [math]\displaystyle{ C = s_2 = \operatorname{B-JOBTITLE}, s_3 =...= s_5 = \operatorname{I-JOBTITLE}, s_6 \ne \operatorname{I-JOBTITLE}\gt }[/math].

The calculations of the forward values can be made to conform to [math]\displaystyle{ C }[/math] by the recursion [math]\displaystyle{ \alpha'_q(s_i) = q(si) = }[/math]

[math]\displaystyle{ \begin{cases} \sum_{s'} [ \alpha'_{q-1}(s')\operatorname{exp}(), & \mbox{if s_i} \simeq s_q \\ 0, & \mbox{if }n\mbox{otherwise} \end{cases} }[/math]

[math]\displaystyle{ \begin{cases} \sum_{s'} [ \alpha'_{q-1}(s')\operatorname{exp}(\sum_k \lambda\k f\k (s', s_i, \bf{o} t))] & \mbox{if} s_i \simeq s_q \\ 0 & \mbox{if }n\mbox{otherwise} \end{cases} }[/math]

for all [math]\displaystyle{ s_q \in C }[/math], where the operator si ' sq means si conforms to constraint sq. For time steps not constrained by C, Eq. 3 is used instead.

If [math]\displaystyle{ \alpha'_{t+1}(s_i) }[/math] is the constrained forward value, then [math]\displaystyle{ Z_0 }[/math] o = P i �0 T (si) is the value of the constrained lattice, the set of all paths that conform to [math]\displaystyle{ C }[/math]. Our confidence estimate is obtained by normalizing [math]\displaystyle{ Z'_0 }[/math] using [math]\displaystyle{ Z_o }[/math], i.e. [math]\displaystyle{ Z'_0 - Z_o }[/math].

We also implement an alternative method that uses the state probability distributions for each state in the extracted field. Let [math]\displaystyle{ \gamma_t(s_i) = p(s_i|o_1,...,o_T) }[/math] be the probability of being in state [math]\displaystyle{ i }[/math] at time [math]\displaystyle{ t }[/math] given the observation sequence. We define the confidence measure GAMMA to be [math]\displaystyle{ \prod_{i=u}^v \gamma_i(s_i) }[/math], where [math]\displaystyle{ u }[/math] and [math]\displaystyle{ v }[/math] are the start and end indices of the extracted field.

4 Record Confidence Estimation

We can similarly use CFB to estimate the probability that an entire record is labeled correctly. The procedure is the same as in the previous section, except that [math]\displaystyle{ C }[/math] now specifies the labels for all fields in the record. We also implement three alternative record confidence estimates. FIELDPRODUCT calculates the confidence of each field in the record using CFB, then multiplies these values together to obtain the record confidence. FIELDMIN instead uses the minimum field confidence as the record confidence. VITERBIRATIO uses the ratio of the probabilities of the top two Viterbi paths, capturing how much more likely [math]\displaystyle{ s }[/math]� is than its closest alternative.

5 Reranking with Maximum Entropy

We also trained two conditional maximum entropy classifiers to classify fields and records as being labeled correctly or incorrectly. The resulting posterior probability of the “correct” label is used as the confidence measure. The approach is inspired by results from (Collins, 2000), which show discriminative classifiers can improve the ranking of parses produced by a generative parser. After initial experimentation, the most informative inputs for the field confidence classifier were field length, the predicted label of the field, whether or not this field has been extracted elsewhere in this record, and the CFB confidence estimate for this field. For the record confidence classifier, we incorporated the following features: record length, whether or not two fields were tagged with the same label, and the CFB confidence estimate.

6 Experiments

2187 contact records (27,560 words) were collected from Web pages and email and 25 classes of data fields were hand-labeled.^[1] The features for the CRF consist of the token text, capitalization features, 24 regular expressions over the token text (e.g. CONTAINSHYPHEN), and offsets of these features within a window of size 5. We also use 19 lexicons, including “US Last Names,” “US First Names,” and “State Names.” Feature induction is not used in these experiments. The CRF is trained on 60% of the data, and the remaining 40% is split evenly into development and testing sets. The development set is used to train the maximum entropy classifiers, and the testing set is used to measure the accuracy of the confidence estimates. The CRF achieves an overall token accuracy of 87.32 on the testing data, with a field-level performance of F1 = 84.11, precision = 85.43, and recall = 82.83.

To evaluate confidence estimation, we use three methods. The first is Pearson’s r, a correlation coefficient ranging from -1 to 1 that measures the correlation between a confidence score and whether or not the field (or record) is correctly labeled. The second is average precision, used in the Information Retrieval community to evaluate ranked lists. It calculates the precision at each point in the ranked list where a relevant document is found and then averages these values. Instead of ranking documents by their relevance score, here we rank fields (and records) by their confidence score, where a correctly labeled field is analogous to a relevant document. WORSTCASE is the average precision obtained by ranking all incorrect instances above all correct instances. Tables 1 and 2 show that CFB and MAXENT are statistically similar, and that both outperform competing methods. Note that WORSTCASE achieves a high average precision simply because so many fields are correctly labeled. In all experiments, RANDOM assigns confidence values chosen uniformly at random between 0 and 1.

Pearson’s r Avg. Prec
CFB .573 .976
MaxEnt .571 .976
Gamma .418 .912
Random .012 .858
WorstCase – .672

Table 1: Evaluation of confidence estimates for field confidence.

CFB and MAXENT outperform competing methods.
Pearson’s r Avg. Prec
CFB .626 .863
MaxEnt .630 .867
FieldProduct .608 .858
FieldMin .588 .843
ViterbiRatio .313 .842
Random .043 .526
WorstCase – .304

Table 2: Evaluation of confidence estimates for record confidence. CFB, MAXENT again perform best.

The third measure is an accuracy-coverage graph. Better confidence estimates push the curve to the upper-right. Figure 1 shows that CFB andMAXENT dramatically outperform GAMMA. Although omitted for space, similar results are also achieved on a noun-phrase chunking task (CFB r = .516, GAMMA r = .432) and a named-entity extraction task (CFB r = .508, GAMMA r = .480).

}}

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2004 ConfidenceEstimationforInformat	Aron Culotta Andrew McCallum			Confidence Estimation for Information Extraction						2004

↑ The 25 fields are: FirstName, MiddleName, LastName, NickName, Suffix, Title, JobTitle, CompanyName, Department, AddressLine, City1, City2, State, Country, PostalCode, HomePhone, Fax, CompanyPhone, DirectCompanyPhone, Mobile, Pager, VoiceMail, URL, Email, InstantMessage

[1] The 25 fields are: FirstName, MiddleName, LastName, NickName, Suffix, Title, JobTitle, CompanyName, Department, AddressLine, City1, City2, State, Country, PostalCode, HomePhone, Fax, CompanyPhone, DirectCompanyPhone, Mobile, Pager, VoiceMail, URL, Email, InstantMessage

[1]

2004 ConfidenceEstimationforInformat

Notes

Cited By

2010

2007

2005