2008 RedundancyInWebScaleIE

From GM-RKB
Jump to navigation Jump to search

Subject Headings: KnowItAll System, KnowItAll Hypothesis, Information Extraction from Text Task.

Notes

Quotes

Abstract

Information Extraction Task|Information Extraction (IE) is the task of automatically extracting knowledge from text. The massive body of text now available on the World Wide Web presents an unprecedented opportunity for IE. IE systems promise to encode vast quantities of Web content into machine-processable knowledge bases, presenting a new approach to a fundamental challenge for artificial intelligence: the automatic acquisition of massive bodies of knowledge. Such knowledge bases would dramatically extend the capabilities of Web applications. Future Web search engines, for example, could query the knowledge bases to answer complicated questions that require synthesizing information across multiple Web pages.

However, IE on the Web is challenging due to the enormous variety of distinct concepts expressed. All extraction techniques make errors, and the standard error-detection strategy used in previous, small-corpus extraction systems — hand-labeling examples of each concept to be extracted, then training a classifier using the labeled examples — is intractable on the Web. How can we automatically identify correct extractions for arbitrary target concepts, without hand-labeled examples?

This thesis shows how IE on the Web is made possible through the KnowItAll hypothesis, which states that extractions that occur more frequently in distinct sentences in a corpus are more likely to be correct. The KnowItAll hypothesis holds on the Web, and can be used to identify many correct extractions because the Web is highly redundant: individual facts are often repeated many times, and in many different ways. In this thesis, we show that a probabilistic model of the KnowItAll hypothesis, coupled with the redundancy of the Web, can power effective IE for arbitrary target concepts without hand-labeled data. In experiments with IE on the Web, we show that the probabilities produced by our model are 15 times better, on average, when compared with techniques from previous work. We also prove formally that under the assumptions of the model, “Probably Approximately Correct” IE can be attained from only unlabeled data.

. …

3.8 Related Work

In contrast to the bulk of previous IE work, our focus is on unsupervised IE (UIE) where Urns substantially outperforms previous methods (Figure 3.2).

In addition to the noisy-or models we compare against in our experiments, the IE literature contains a variety of heuristics using repetition as an indication of the veracity of extracted information. For example, Riloff and Jones [50] rank extractions by the number of distinct patterns generating them, plus a factor for the reliability of the patterns. Our work is intended to formalize these heuristic techniques, and unlike the noisy-or models, we explicitly model the distribution of the target and error sets (our num(C) and num(E)), which is shown to be important for good performance in Section 3.4.1. The accuracy of the probability estimates produced by the heuristic and noisy-or methods is rarely evaluated explicitly in the IE literature, although most systems make implicit use of such estimates. For example, bootstrap-learning systems start with a set of seed instances of a given relation, which are used to identify extraction patterns for the relation; these patterns are in turn used to extract further instances (e.g. [50, 35, 3]). As this process iterates, random extraction errors result in overly general extraction patterns, leading the system to extract further erroneous instances. The more accurate estimates of extraction probabilities produced by Urns would help prevent this “concept drift.”

Skounakis and Craven [55 develop a probabilistic model for combining evidence from multiple extractions in a supervised setting. Their problem formulation differs from ours, as they classify each occurrence of an extraction, and then use a binomial model along with the false positive and true positive rates of the classifier to obtain the probability that at least one occurrence is a true positive. Similar to the above approaches, they do not explicitly account for sample size [math]\displaystyle{ n }[/math], nor do they model the distribution of target and error extractions.

Culotta and McCallum [16 provide a model for assessing the confidence of extracted information using conditional random fields (CRFs). Their work focuses on assigning accurate confidence values to individual occurrences of an extracted field based on textual features. This is complementary to our focus on combining confidence estimates from multiple occurrences of the same extraction. In fact, each possible feature vector processed by the CRF in [16] can be thought of as a virtual urn m in our Urns. The confidence output of Culotta and McCallum’s model could then be used to provide the precision pm for the urn.


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 RedundancyInWebScaleIEDoug DowneyRedundancy in Web-scale Information Extraction: Probabilistic Model and Experimental Resultshttp://turing.cs.washington.edu/papers/ddowney thesis.pdf2008