2002 ABootstrappingMethForLearnSemLexUsExtrPatCtxts

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Bootstrapping, Relation Recognition, Semi-Supervised Named Entity Recognition Algorithm, Basilisk Algorithm.

Notes

Quotes

Abstract

This paper describes a bootstrapping algorithm called Basilisk that learns high-quality semantic lexicons for multiple categories. Basilisk begins with an unannotated corpus and seed words for each semantic category, which are then bootstrapped to learn new words for each category. Basilisk hypothesizes the semantic class of a word based on collective information over a large body of extraction pattern contexts. We evaluate Basilisk on six semantic categories. The semantic lexicons produced by Basilisk have higher precision than those produced by previous techniques, with several categories showing substantial improvement.

1 Introduction

In recent years, several algorithms have been developed to acquire semantic lexicons automatically or semi-automatically using corpus-based techniques. For our purposes, the term semantic lexicon will refer to a dictionary of words labeled with semantic classes (e.g., \bird" is an animal and \truck" is a vehicle). Semantic class information has proven to be useful for many natural language processing tasks, including information extraction (Rilo and Schmelzenbach, 1998; Soderland et al., 1995), anaphora resolution (Aone and Bennett, 1996), question answering (Moldovan et al., 1999; Hirschman et al., 1999), and prepositional phrase attachment (Brill and Resnik, 1994). Although some semantic dictionaries do exist (e.g., WordNet (Miller, 1990)), these resources often do not contain the specialized vocabulary and jargon that is needed for speci c domains. Even for relatively general texts, such as the Wall Street Journal (Marcus et al., 1993) or terrorism articles (MUC-4 Proceedings, 1992), Roark and Charniak (Roark and Charniak, 1998) reported that 3 of every 5 terms generated by their semantic lexicon learner were not present in WordNet. These results suggest that automatic semantic lexicon acquisition could be used to enhance existing resources such as WordNet, or to produce semantic lexicons for specialized domains.

We have developed a weakly supervised bootstrapping algorithm called Basilisk that automatically generates semantic lexicons. Basilisk hypothesizes the semantic class of a word by gathering collective evidence about semantic associations from extraction pattern contexts. Basilisk also learns multiple semantic classes simultaneously, which helps constrain the bootstrapping process.

First, we present Basilisk's bootstrapping algorithm and explain how it di ers from previous work on semantic lexicon induction. Second, we present empirical results showing that Basilisk outperforms a previous algorithm. Third, we explore the idea of learning multiple semantic categories simultaneously by adding this capability to Basilisk as well as another bootstrapping algorithm. Finally, we present results showing that learning multiple semantic categories simultaneously improves performance.

2 Bootstrapping using Collective

Evidence from Extraction Patterns Basilisk (Bootstrapping Approach to Semantic Lexicon Induction using Semantic Knowledge) is a weakly supervised bootstrapping algorithm that automatically generates semantic lexicons. Figure 1 shows the high-level view of Basilisk's bootstrapping process. The input to Basilisk is an unannotated text corpus and a few manually defi ned seed words for each semantic category.

2.1.3 Related Work

Several weakly supervised learning algorithms have previously been developed to generate semantic lexicons from text corpora. Rilo and Shepherd (Rilo and Shepherd, 1997) developed a bootstrapping algorithm that exploits lexical co-occurrence statistics, and Roark and Charniak (Roark and Charniak, 1998) are ned this algorithm to focus more explicitly on certain syntactic structures. Hale, Ge, and Charniak (Ge et al., 1998) devised a technique to learn the gender of words. Caraballo (Caraballo, 1999) and Hearst (Hearst, 1992) created techniques to learn hypernym/hyponym relationships. None of these previous algorithms used extraction patterns or similar contexts to infer semantic class associations.

Several learning algorithms have also been developed for named entity recognition (e.g., (Collins and Singer, 1999; Cucerzan and Yarowsky, 1999)). (Collins and Singer, 1999) used contextual information of a di fferent sort than we do. Furthermore, our research aims to learn general nouns (e.g., \artist") rather than proper nouns, so many of the features commonly used to great advantage for named entity recognition (e.g., capitalization and title words) are not applicable to our task.

The algorithm most closely related to Basilisk is meta-bootstrapping (Rilo ff and Jones, 1999), which also uses extraction pattern contexts for semantic lexicon induction. Meta-bootstrapping identi fies a single extraction pattern that is highly correlated with a semantic category and then assumes that all of its extracted noun phrases belong to the same category. However, this assumption is often violated, which allows incorrect terms to enter the lexicon. Rilo ff and Jones acknowledged this issue and used a second level of bootstrapping (the "Meta" bootstrapping level) to alleviate this problem. While meta-bootstrapping trusts individual extraction patterns to make unilateral decisions, Basilisk gathers collective evidence from a large set of extraction patterns. As we will demonstrate in Section 2.2, Basilisk's approach produces better results than meta-bootstrapping and is also considerably more efficient because it uses only a single bootstrapping loop (meta-bootstrapping uses nested bootstrapping). However, meta-bootstrapping produces category-speci fic extraction patterns in addition to a semantic while Basilisk focuses exclusively on semantic lexicon induction.

References

  • 1. Chinatsu Aone, Scott Bennett, Applying machine learning to anaphora resolution, Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, p.302-314, January 1996
  • 2. Eric Brill, Philip Resnik, A rule-based approach to prepositional phrase attachment disambiguation, Proceedings of the 15th conference on Computational linguistics, August 05-09, 1994, Kyoto, Japan doi:10.3115/991250.991346
  • 3. Sharon A. Caraballo, Automatic construction of a hypernym-labeled noun hierarchy from text, Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, p.120-126, June 20-26, 1999, College Park, Maryland doi:10.3115/1034678.1034705
  • 4. M. Collins and Yoram Singer. (1999). Unsupervised Models for Named Entity Classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Me thods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-99).
  • 5. S. Cucerzan and D. Yarowsky. (1999). Language Independent Named Entity Recognition Combining Morph ological and Contextual Evidence. In: Proceedings of the Joint SIGDAT Conference on Empirical Me thods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-99).
  • 6. W. Gale, Kenneth W. Church, and David Yarowsky. (1992). A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26:415--439.
  • 7. Niyu Ge, John Hale, and Eugene Charniak. (1998). A statistical approach to anaphora resolution. In: Proceedings of the Sixth Workshop on Very Large Corpora.
  • 8. Marti A. Hearst, Automatic acquisition of hyponyms from large text corpora, Proceedings of the 14th conference on Computational linguistics, August 23-28, 1992, Nantes, France doi:10.3115/992133.992154
  • 9. Lynette Hirschman, Marc Light, Eric Breck, John D. Burger, Deep Read: a reading comprehension system, Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, p.325-332, June 20-26, 1999, College Park, Maryland doi:10.3115/1034678.1034731
  • 10. Mitchell P. Marcus, Mary Ann Marcinkiewicz, Beatrice Santorini, Building a large annotated corpus of English: the penn treebank, Computational Linguistics, v.19 n.2, June 1993
  • 11. George Miller. (1990). Wordnet: An on-line lexical database. In International Journal of Lexicography.
  • 12. Dan Moldovan, Sanda Harabagiu, Marius Paşca, Rada Mihalcea, Richard Goodrum, Roxana Gîrju, and Vasile Rus. (1999). LASSO: A tool for surfing the answer net. In: Proceedings of the Eighth Text REtrieval Conference (TREC-8).
  • 13. MUC-4 Proceedings. (1992). Proceedings of the Fourth Message Understanding Conference (MUC-4). Morgan Kaufmann, San Mateo, CA.
  • 14. Ellen Riloff, Rosie Jones, Learning dictionaries for information extraction by multi-level bootstrapping, Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence, p.474-479, July 18-22, 1999, Orlando, Florida, United States
  • 15. E. Riloff, and M. Schmelzenbach. (1998). An Empirical Approach to Conceptual Case Frame Acquisition. In: Proceedings of the Sixth Workshop on Very Large Corpora, pages 49--56.
  • 16. E. Riloff and J. Shepherd. (1997). A Corpus-based Approach for Building Semantic Lexicons. In: Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pages 117--124.
  • 17. E. Riloff. (1996). Automatically Generating Extraction Patterns from Untagged Text. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 1044--1049. The AAAI Press/MIT Press.
  • 18. Brian Roark, Eugene Charniak, Noun-phrase co-occurrence statistics for semiautomatic semantic lexicon construction, Proceedings of the 17th International Conference on Computational linguistics, p.1110-1116, August 10-14, 1998, Montreal, Quebec, Canada
  • 19. Stephen Soderland, David Fisher, Jonathan Aseltine, and Wendy Lehnert. (1995). CRYSTAL: Inducing a conceptual dictionary. In: Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 1314--1319.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2002 ABootstrappingMethForLearnSemLexUsExtrPatCtxtsEllen Riloff
Michael Thelen
A Bootstrapping Method for Learning Semantic Lexicons Using Extraction Pattern ContextsProceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processinghttp://acl.ldc.upenn.edu/W/W02/W02-1028.pdf10.3115/1118693.11187212002