2005 UnsupNEExtrFromTheWeb

Jump to navigation Jump to search

Subject Headings: Semi-Supervised Named Entity Recognition Algorithm, KnowItAll System, Information Extraction Task, Pointwise Mutual Information and Information Retrieval.


Cited By

  • ~344 …



  • (EtzioniBC, 2006) ⇒ Oren Etzioni, Michele Banko, and M. J. Cafarella. (2006). “Machine Reading.” In: Proceedings of AAAI-2006.
    • The KnowItAll Web IE system (Etzioni et al., 2005) took the next step in automation by learning to label its own training examples using only a small set of domain independent extraction patterns, thus being the first published system to carry out unsupervised, domain independent, large-scale extraction from Web pages. … When instantiated for a particular relation, these generic patterns yield relation-specific extraction rules that are then used to learn domain-specific extraction rules. The rules are applied to Web pages, identified via search-engine queries, and the resulting extractions are assigned a probability using mutual-information measures derived from search engine hit counts. For example, KnowItAll utilized generic extraction patterns like “<Class> such as <Mem>” to suggest instantiations of <Mem> as candidate members of the class. Next, KnowItAll used frequency information to identify which instantiations are most likely to be bona-fide members of the class. Thus, it was able to confidently label major cities including Seattle, Tel Aviv, and London as members of the class “Cities” (Downey, Etzioni, and Soderland 2005). Finally, KnowItAll learned a set of relation-specific extraction patterns. … KnowItAll is self supervised--- instead of utilizing handtagged training data, the system selects and labels its own training examples, and iteratively bootstraps its learning process. In general, self-supervised systems are a species of unsupervised systems because they require no handtagged training examples whatsoever. However, unlike classical unsupervised systems (e.g., clustering) selfsupervised systems do utilize labeled examples and do form classifiers whose accuracy can be measured using standard metrics. Instead of relying on hand-tagged data, self-supervised systems autonomously “roll their own” labeled examples. … While self-supervised, KnowItAll is relation-specific--- it requires a laborious bootstrapping process for each relation of interest, and the set of relations of interest has to be named by the human user in advance.



The KNOWITALL system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KNOWITALL's novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KNOWITALL extracted over 50,000 class instances, but suggested a challenge: How can we improve KNOWITALL's recall and extraction rate without sacrificing precision?This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., "chemist" and "biologist" are identified as sub-classes of "scientist"). List Extraction locates lists of class instances, learns a "wrapper" for each list, and extracts elements of each list. Since each method bootstraps from KNOWITALL's domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KNOWITALL a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.



5. List Extractor

We now present the third method for increasing KNOWITALL’s recall, the List Extractor (LE). Where the methods described earlier extract information from unstructured text on Web pages, LE uses regular page structure to support extraction. LE locates lists of items on Web pages, learns a wrapper on the fly for each list, automatically extracts items from these lists, then sorts the items by the number of lists in which they appear.


5.4. Example and parameters

We consider a relatively simple example in Fig. 15 in order to see how the algorithm works, and to illustrate the effects of different parameters on precision, recall, overfitting, and generalization. On top we have the 4 seeds used to search and retrieve the HTML document, and below we have the 5 wrappers learned from at least 2 keywords and their bounding lines in the HTML.

The first wrapper, w1, is learned for the whole HTML document, and matches all 4 keywords; w2 is for the body, and is identical to w1, except for the context; w3 has the same wrapper pattern as w1 and w2, contains all keywords, but has a noticeably different and smaller context (just the single table block); w4 is interesting because here we see an example of overfitting. The suffix is too long and will not extract France. We see a similar problem in w5 where the prefix is too long and will not extract Israel.



  • Eugene Agichtein and L. Gravano. Snowball: Extracting Relations from Large Plain-Text Collections. In: Proceedings of the 5th ACM International Conference on Digital Libraries, pages 85–94, San Antonio, Texas, 2000.
  • Eugene Agichtein and L. Gravano. Querying Text Databases for Efficient Information Extraction. In: Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE 2003), pages 113– 124, Bangalore, India, 2003.
  • Eugene Agichtein, L. Gravano, J. Pavel, V. Sokolova, and A. Voskoboynik. Snowball: A Prototype System for Extracting Relations from Large Text Collections. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, Santa Barbara, California, 2001.
  • A. Blum and Tom M. Mitchell. Combining Labeled and Unlabeled Data with Co-Training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory, pages 92–100, Madison, Wisconsin, 1998.
  • Eric D. Brill. Some Advances in Rule-based Part of Speech Tagging. In: Proceedings of the Twelfth National Conference on Artificial Intelligence, pages 722–727, Seattle, Washington, 1994.
  • Brin. Extracting Patterns and Relations from the World Wide Web. In WebDB Workshop at 6th International Conference on Extending Database Technology, EDBT’98, pages 172–183, Valencia, Spain, 1998.
  • M.E. Califf and R.J. Mooney. Relational Learning of Pattern-Match Rules for Information Extraction. In Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, pages 6–11, Menlo Park, CA, (1998). AAAI Press.
  • Fabio Ciravegna. Adaptive Information Extraction from Text by Rule Induction and Generalisation. In Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI 2001), pages 1251–1256, Seattle, Washington, 2001.
  • Fabio Ciravegna, Alexiei Dingli, D. Guthrie, and Y. Wilks. Integrating Information to Bootstrap Information Extraction from Web Sites. In: Proceedings of the IIWeb Workshop at the 19th International Joint Conference on Artificial Intelligence (IJCAI 2003), pages 9–14, Acapulco, Mexico, 2003.
  • W. Cohen and W. Fan. Web-Collaborative Filtering: Recommending Music by Crawling the Web. Computer Networks (Amsterdam, Netherlands: 1999), 33(1–6):685–698, (2000). 39
  • W. Cohen, M. Hurst, and L.S. Jensen. A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. In: Proceedings of the 11th International World Wide Web Conference, pages 323–241, Honolulu, Hawaii, 2002.
  • Michael Collins and Yoram Singer. Unsupervised Models for Named Entity Classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 100–111, Maryland, USA, 1999.
  • M. Craven, D. DiPasquo, Dayne Freitag, Andrew McCallum, Tom M. Mitchell, K. Nigam, and S. Slattery. Learning to Construct Knowledge Bases from the World Wide Web. Artificial Intelligence 118(1-2), pages 69–113, 2000.
  • S. Dill, N. Eiron, D. Gibson, D. Gruhl, Ramanathan V. Guha, A. Jhingran, T. Kanungo, S. Rajagopalan, A. Tomkins, J. Tomlin, and J. Zien. SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation. In: Proceedings of the 12th International Conference on World Wide Web, pages 178–186, Budapest, Hungary, 2003.
  • Pedro Domingos and Michael J. Pazzani. On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning, 29:103–130, 1997.
  • R. Doorenbos, Oren Etzioni, and D. Weld. A scalable comparison-shopping agent for the World-Wide Web. In: Proceedings of the First International Conference on Autonomous Agents, pages 39–48, Marina del Rey, California, 1997.
  • D. Downey, Oren Etzioni, and S. Soderland. A Probabilistic Model of Redundancy in Information Extraction. Submitted for publication.
  • D. Downey, Oren Etzioni, S. Soderland, and D.S. Weld. Learning Text Patterns for Web Information Extraction and Assessment. In AAAI-04 Workshop on Adaptive Text Extraction and Mining, pages 50–55, 2004.
  • Oren Etzioni. Moving Up the Information Food Chain: Softbots as Information Carnivores. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence, (1996). Revised version reprinted in AI Magazine special issue, Summer ’97.
  • Oren Etzioni, Michael J. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Web-Scale Information Extraction in KnowItAll. In: Proceedings of the 13th International World Wide Web Conference (WWW-04), pages 100–110, New York City, New York, 2004.
  • (Freitag & McCallum, 1999) ⇒ Dayne Freitag, and Andrew McCallum. (1999). “Information Extraction with HMMs and Shrinkage.” In: Proceedings of the AAAI 1999 Workshop on Machine Learning for Information Extraction.
  • Marti Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora. In: Proceedings of the 14th International Conference on Computational Linguistics, pages 539–545, Nantes, France, 1992.
  • R. Jones, R. Ghani, Tom M. Mitchell, and Ellen Riloff. Active Learning for Information Extraction with Multiple View Feature Sets. In: Proceedings of the ECML/PKDD-03 Workshop on Adaptive Text Extraction and Mining, Catvat–Dubrovnik, Croatia, 2003.
  • N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper Induction for Information Extraction. In: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, pages 729–737. San Francisco, CA: Morgan Kaufmann, 1997.
  • C. T. Kwok, Oren Etzioni, and D. Weld. Scaling Question Answering to the Web. ACM Transactions on Information Systems (TOIS), 19(3):242–262, 2001.
  • W. Lin, R. Yangarber, and Ralph Grishman. Bootstrapped Learning of Semantic Classes from Positive and Negative Examples. In: Proceedings of ICML-2003 Workshop on The Continuum from Labeled to Unlabeled Data, pages 103–111, Washington, D.C, 2003.
  • Bernardo Magnini, M. Negri, and H. Tanev. Is It the Right Answer? Exploiting Web Redundancy for Answer Validation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 425–432, 2002.
  • M.Banko, E.Brill, S.Dumais, and J.Lin. AskMSR: Question Answering Using theWorldwideWeb. In: Proceedings of 2002 AAAI Spring Symposium on Mining Answers from Texts and Knowledge Bases, pages 7–9, Palo Alto, California, 2002.
  • Andrew McCallum. Efficiently Inducing Features of Conditional Random Fields. In: Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, pages 403–410, Acapulco, Mexico, 2003.
  • I. Muslea, S. Minton, and C. Knoblock. Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems, 4(1/2):93–114, 2001.
  • K. Nigam and R. Ghani. Understanding the Behavior of Co-training. In: Proceedings of the KDD-2000 Workshop on Text Mining, pages 105–107, Boston, Massachussetts, 2000.
  • K. Nigam, John D. Lafferty, and Andrew McCallum. Using Maximum Entropy for Text Classification. In: Proceedings of IJCAI-99 Workshop on Machine Learning for Information Filtering, pages 61–67, Stockholm, Sweden, 1999.
  • K. Nigam, Andrew McCallum, S. Thrun, and Tom M. Mitchell. Learning to Classify Text from Labeled and Unlabeled Documents. In: Proceedings of the 15th Conference of the American Association for Artificial Intelligence (AAAI-98), pages 792–799, Madison, Wisconsin, 1998.
  • K. Nigam, Andrew McCallum, S. Thrun, and Tom M. Mitchell. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2/3):103–134, 2000.
  • W. Phillips and Ellen Riloff. Exploiting Strong Syntactic Heuristics and Co-Training to Learn Semantic Lexicons. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pages 125–132, Philadelphia, Pennsylvania, 2002.
  • D. Ravichandran and D. Hovy. Learning Surface Text Patterns for a Question Answering System. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 41–47, Philadelphia, Pennsylvania, 2002.
  • Ellen Riloff and R. Jones. Learning Dictionaries for Information Extraction by Multi-level Bootstrapping. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence, pages 474–479, (1999). 41
  • E. Rosch, C. B. Mervis,W. Gray, D. Johnson, and P. Boyes-Bream. Basic objects in natural categories. Cognitive Psychology, 3:382–439, 1976.
  • L. Schubert. Can we derive general world knowledge from texts. In: Proceedings of Human Language Technology Conference, 2002.
  • R. Snow, Daniel Jurafsky, and A.Y. Ng. Learning Syntactic Patterns for Automatic Hypernym Discovery. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems 17. MIT Press, Cambridge, MA, 2005.
  • S. Soderland. Learning Information Extraction Rules for Semi-structured and Free Text. Machine Learning, 34(1–3):233–272, 1999.
  • M. Thelen and Ellen Riloff. A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts. In: Proceedings of the 2002 Conference on Empirical Methods in NLP, pages 214– 221, Philadelphia, Pennsylvania, 2002.
  • [43] P. D. Turney. Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In: Proceedings of the Twelfth European Conference on Machine Learning, pages 491–502, Freiburg, Germany, 2001.
  • P.D. Turney. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 129–159, Philadelphia, Pennsylvania, 2002.
  • P.D. Turney and M. Littman. Measuring Praise and Criticism: Inference of Semantic Orientation from Association. ACM Transactions on Information Systems (TOIS), 21(4):315–346, 2003.
  • O. Uryupina. Semi-Supervised Learning of Geographical References within Text. In: Proceedings of the NAACL-03 Workshop on the Analysis of Geographic References, pages 21–29, Edmonton, Canada, 2003.


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 UnsupNEExtrFromTheWebDoug Downey
Stephen Soderland
Michael J. Cafarella
Daniel S. Weld
Alexander Yates
Oren Etzioni
Ana-Maria Popescu
Tal Shaked
Unsupervised Named-Entity Extraction from the Web: An Experimental StudyIntelligent (AI) Machinehttp://turing.cs.washington.edu/papers/KnowItAll AIJ.pdf2005