2009 OpenInformationExtractionForTheWeb

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Open Information Extraction Task, Web-based Information Extraction.

Notes

Quotes

Abstract

The World Wide Web contains a significant amount of information expressed using natural language. While unstructured text is often difficult for machines to understand, the field of Information Extraction (IE) offers a way to map textual content into a structured knowledge base. The ability to amass vast quantities of information from Web pages has the potential to increase the power with which a modern search engine can answer complex queries.

IE has traditionally focused on acquiring knowledge about particular relationships within a small collection of domain-specific text. Typically, a target relation is provided to the system as Input along with extraction patterns or examples that have been specified by hand. Shifting to a new relation requires a person to create new patterns or examples. This manual labor scales linearly with the number of relations of interest.

The task of extracting information from the Web presents several challenges for existing IE systems. The Web is large and heterogeneous; the number of potentially interesting relations is massive and their identity often unknown. To enable large-scale knowledge acquisition from the Web, this thesis presents Open Information Extraction, a novel extraction paradigm that automatically discovers thousands of relations from unstructured text and readily scales to the size and diversity of the Web.

1.2.1 Weakly-Supervised Systems

[[Brin [11]], [[Agichtein and Gravano [1]], [[Riloff and Jones [70]], [[Pasca et al. [62]], and [[Bunescu and Mooney [12]] sought to reduce the amount of manual labor necessary to perform relation-specific extraction. Rather than demand hand-tagged corpora, these weakly-supervised IE systems required a user to specify relation-specific knowledge in the form of a small set of seed instances known to satisfy the relation of interest. For instance, by specifying the pairs (Microsoft, Redmond), (Exxon, Irving) and (Intel, Santa Clara) these IE systems learned patterns (e.g. <X> ’s headquarters in <Y> and <Y> -based <X> ) that identified additional pairs of company names and locations satisfying the Headquarters(X, Y ) relation. While these systems reduced the amount of required labeled inputs by a significant amount, and can achieve levels of precision and recall on par with fully-supervised systems, the remaining amount of labeling effort becomes non-trivial when the goal is to extract instances of thousands of relations.

1.2.2 Self-Supervised Systems

KnowItAll [33] is a state-of-the-art Web extraction system that addresses the automation challenge by learning to label its own training examples, and tackles issues pertaining to corpus heterogeneity by not relying on deep linguistic analysis or entity recognizers. Given a relation, KnowItAll used a set of domain-independent patterns to automatically instantiate relation-specific extraction rules. For example, KnowItAll utilized generic extraction patterns like “<X> is a <Y>” to find a list of candidate members X of the class Y . When this pattern is used for the class Country, for instance, it would match the sentence “Spain is a southwestern European country located on the Iberian Peninsula,” and output Country(Spain).

KnowItAll’s extraction patterns were applied to Web pages identified via search-engine queries. The resulting extractions were assigned a probability using information-theoretic measures derived from search engine hit counts, providing a method of identifying which instantiations were most likely to be bona-fide members of the class. For example, in order to estimate the likelihood that “China” is the name of a country, KnowItAll used automatically generated phrases associated with the class to see if there is a high correlation between the number of documents containing the word “China” and those containing the phrase “countries such as.” Thus KnowItAll was able to confidently label China, France, and India as members of the class Country while correctly knowing that “Garth Brooks is a country singer” does not provide sufficient evidence that “Garth Brooks” is the name of a country [30]. Finally, KnowItAll used a pattern-learning algorithm to acquire relation-specific extraction patterns (e.g. “capital of <country>”) that led it to extract additional countries. Inspired by KnowItAll, the URES Web IE system [71], also utilized high-quality output from baseline KnowItAll to automatically supervise the learning of relation-specific extraction patterns with success.

KnowItAll and URES are self-supervised; instead of utilizing hand-tagged training data, each system selects and labels its own training examples and iteratively bootstraps its learning process. Self-supervised systems are a species of unsupervised systems because they require no hand-tagged training examples. However, unlike classical unsupervised systems, self-supervised systems do utilize labeled examples. Instead of relying on hand-tagged data, self-supervised systems autonomously “roll their own” labeled examples.

KnowItAll was the first published system to carry out unsupervised, domain-independent, large-scale extraction from Web pages. The first implementation of KnowItAll required large numbers of search engine queries and Web page downloads; as a result experiments using KnowItAll often took weeks to complete. This issue was addressed in a subsequent implementation, KnowItNow [14]. Despite having made important progress in automating IE at a Web scale, KnowItAll and KnowItNow are relation-specific — the set of relations has to be named by the human user in advance. This is a significant obstacle to open-ended extraction; while processing text one often encounters unanticipated concepts and relations. Furthermore, the extraction process is performed over the entire corpus each time a relation of interest is identified. In the remaining chapters we show how the Open IE paradigm retains KnowItAll’s benefits but eliminates it inefficiencies.


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2009 OpenInformationExtractionForTheWebMichele BankoOpen Information Extraction for the Webhttp://turing.cs.washington.edu/papers/banko-thesis.pdf2009