2001 ScenearioCustomizationForIE

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Information Extraction Algorithm, Proteus,

Notes

Cited By

~32 http://scholar.google.com/scholar?cites=5713356703471600379

Quotes

Abstract

  • Information Extraction (IE) is an emerging NLP technology, whose function is to process unstructured, natural language text, to locate specific pieces of information, or facts, in the text, and to use these facts to fill a database.
  • Construction of a pattern base for a new topic is recognized as a time-consuming and expensive process — a principal roadblock to wider use of IE technology in the large. An effective pattern base must be precise and have wide coverage.
  • Each record in the table can have a link back to the originating document. The exact place in the document where the event is reported may be highlighted for convenient access.
  • In the context of IE, “meaning” is understood in terms of facts, formally described as a fixed set of semantic objects — entities, relationships among entities, and events in which entities participate. The semantic objects belong to a small number of types, all having fixed regular structure, within a fixed and closely circumscribed subject domain. This regularity of structure allows the objects to be stored in a relational database.
  • the topic of MUC-6 (the Sixth Message Understanding Conference); in this scenario the system seeks events in which highlevel corporate executives left their posts or assumed new ones.
  • IE systems today are commonly based on pattern matching and partial syntactic analysis.2 Patterns consist of regular expressions (RE) and their associated mappings from syntactic to logical form. The patterns are stored in a pattern base, one of the knowledge bases (KBs) used by the IE system.
  • Performance: Zipf’s Law, a.k.a. the “long tail” syndrome, is a serious problem in IE, as in other areas of NLP. A large number of facts are covered by a small number of frequently occurring patterns, while the remaining facts — the tail of the distribution — are covered by many more rare patterns.
  • Named Entity (NE): find and categorize (certain classes of) proper names appearing in text.
  • Template Relation (TR): find instances of broader relations among entities, such as the “employment” relation between persons and companies, or the “parent/subsidiary relation” between companies.
  • [51] shows how bootstrapping can effectively train a “concept spotter”, a classifier for proper names or noun phrases, by learning patterns stated in terms of short-range lexical items.
  • [59] uses bootstrapping for word-sense disambiguation; though it’s a somewhat different problem from ours, the ideas have important similarities — which attest to the generality and strength of the overall principles that unify all this research.

References

  • Eugene Agichtein and Luis Gravano. Snowball: Extracting relations from large plain-text collections. In: Proceedings of the 5th ACM International Conference on Digital Libraries (DL'00), (2000). To appear.
  • Douglas E. Appelt, J. Hobbs, J. Bear, D. Israel, M. Kameyama, and M. Tyson. SRI: Description of the JV-FASTUS System used for MUC-5. In: Proceedings of Fifth Message Understanding Conference (MUC-5), Baltimore, MD, August (1993). Morgan Kaufmann.
  • Douglas E. Appelt, Jerry Hobbs, John Bear, David Israel, and Mabry Tyson. FASTUS: A nite-state processor for information extraction from real-world text. In: Proceedings of 13th Int'l Joint Conference Artificial Intelligence (IJCAI-93), pages 1172{1178, August 1993.
  • Amit Bagga and Alan Biermann. Analyzing the performance of message understanding systems. Technical Report CS-1997-01, Dept. of Computer Science, Duke University, 1997.
  • Avrim Blum and Tom M. Mitchell. Combining Labeled and Unlabeled Data with Co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT-98), pages 92{100, New York, July 24{26 (1998). ACM Press.
  • Andrew Borthwick, John Sterling, Eugene Agichtein, and Ralph Grishman. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In: Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, Canada, August 1998.
  • Sergey Brin. Extracting patterns and relations from the world wide web. In WebDB Workshop at 6th International Conference on Extending Database Technology, EDBT'98, 1998.
  • M. E. Cali and Raymond Mooney. Relational learning of pattern-match rules for information extraction. In Working Notes of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, pages 6{11, Menlo Park, CA, (1998). AAAI Press.
  • Mary Elaine Cali . Relational Learning Techniques for Natural Language Information Extraction. PhD thesis, Department of Computer Sciences, University of Texas, Austin, TX, August (1998). Also appears as Artificial Intelligence Laboratory Technical Report AI 98-276 (see http://www.cs.utexas.edu/users/ai-lab).
  • Claire Cardie and David Pierce. Proposal for an interactive environment for information extraction. Technical Report TR98-1702, Cornell University, Computer Science, September 2, 1998.
  • Michael Collins and Yoram Singer. Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, College Park, MD, June (1999). University of Maryland.
  • Ido Dagan, Shaul Marcus, and Shaul Markovitch. Contextual word similarity and estimation from sparse data. In: Proceedings of the 31st Annual Meeting of the Assn. for Computational Linguistics, pages 31{37, Columbus, OH, June 1993.
  • David Fisher, Stephen Soderland, Joseph McCarthy, Fangfang Feng, and Wendy Lehnert. Description of the UMass system as used for MUC-6. In: Proceedings. Sixth Message Understanding Conference (MUC-6), Columbia, MD, November 1995. Morgan Kaufmann.
  • Dayne Freitag and Andrew McCallum. Information extraction with HMMs and shrinkage. In: Proceedings of Workshop on Machine Learning and Information Extraction (AAAI-99), Orlando, FL, July 1999.
  • The second Garnet compendium: Collected papers, 1990-1992. Technical Report CMU-CS-93-108, Carnegie Mellon University, Computer Science, February 1993.
  • Michael Gregory. Private communication, 2000.
  • Ralph Grishman. The NYU system for MUC-6, or where's the syntax? In: Proceedings. Sixth Message Understanding Conference (MUC-6), pages 167{176, Columbia, MD, November (1995). Morgan Kaufmann.
  • Ralph Grishman. Tipster Phase II Architecture Design Document, Version 1.52. New York University, August 1995.
  • Ralph Grishman. Information extraction: Techniques and challenges. In Maria Teresa Pazienza, editor, Information Extraction. Springer-Verlag, Lecture Notes in Artificial Intelligence, Rome, 1997.
  • Ralph Grishman, Catherine Macleod, and Adam Meyers. Comlex Syntax: Building a computational lexicon. In: Proceedings of 15th Int'l Conference Computational Linguistics (COLING 94), pages 268{272, Kyoto, Japan, August 1994.
  • Ralph Grishman, Catherine Macleod, and John Sterling. New york university: Description of the proteus system as used for muc-4. In: Proceedings of Fourth Message Understanding Conference (MUC-4), pages 233{241, McLean, VA, June 1992.
  • Ralph Grishman and John Sterling. New York University: Description of the PROTEUS System as used for MUC-5. In: Proceedings of Fifth Message Understanding Conference (MUC-5), Baltimore, MD, August (1993). Morgan Kaufmann.
  • Ralph Grishman and Roman Yangarber. Issues in corpus-trained information extraction. In: Proceedings of International Symposium: Toward the Realization of Spontaneous Speech Engineering, pages 107{112, Tokyo, Japan, February 2000.
  • Zellig S. Harris. Linguistic transformations for information retrieval. In: Proceedings of International Conference on Scienti c Information, 1957.
  • Lynette Hirschman, Ralph Grishman, and Naomi Sager. Grammatically-based automatic word class formation. Information Processing and Management, 11(1/2):39{57, 1975.
  • Silja Huttunen. Private communication, 1999.
  • Silja Huttunen. Private communication, 2000.
  • Timo J¨arvinen and Pasi Tapanainen. A dependency parser for English. Technical Report TR-1, Department of General Linguistics, University of Helsinki, Finland, February 1997.
  • Martin Kay and Martin R¨oscheisen. Text-translation alignment. Computational Linguistics, 19(1), 1993.
  • W. Lehnert, C. Cardie, D. Fisher, J. McCarthy, E. Rilo, and S. Soderland. University of Massachusetts: MUC-4 test results and analysis. In: Proceedings of Fourth Message Understanding Conf., McLean, VA, June (1992). Morgan Kaufmann.
  • Catherine Macleod, Ralph Grishman, and Adam Meyers. Creating a common syntactic dictionary of English. In: Proceedings of Int'l Workshop on Shared Natural Language Resources, Nara, Japan, August 1994.
  • Adam Meyers, Catherine Macleod, Roman Yangarber, Ralph Grishman, Leslie Barrett, and Ruth Reeves. Using NOMLEX to produce nominalization patterns for information extraction. In: Proceedings of the COLING-ACL '98 Workshop on Computational Treatment of Nominals, Montreal, Canada, August 1998.
  • George A. Miller. Wordnet: a lexical database for English. Communications of the ACM, 38(11):39{41, November 1995.
  • Scott Miller, Michael Crystal, Heidi Fox, Lance Ramshaw, Richard Schwartz, Rebecca Stone, RalphWeischedel, and the Annotation Group. Algorithms that learn to extract information; BBN: Description of the SIFT system as used for MUC-7. In: Proceedings of the Seventh Message Understanding Conference (MUC-7), Fairfax, VA, 1998.
  • Tom M. Mitchell. The role of unlabeled data in supervised learning. In: Proceedings of the Sixth International Colloquium on Cognitive Science, San Sebastian, Spain, 1999.
  • Proceedings of the Third Message Understanding Conference (MUC-3).Morgan Kaufmann, May 1991.
  • Proceedings of the Fourth Message Understanding Conference (MUC-4). Morgan Kaufmann, June 1992.
  • Proceedings of the Fifth Message Understanding Conference (MUC-5), Baltimore, MD, August (1993). Morgan Kaufmann.
  • Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, MD, November (1995). Morgan Kaufmann.
  • Proceedings of the Seventh Message Understanding Conference (MUC-7), Fairfax, VA, (1998). http://www.muc.saic.com/.
  • Johanna Nichols. Secondary predicates. Proceedings of the 4th Annual Meeting of Berkeley Linguistics Society, pages 114{127, 1978.
  • Maria Teresa Pazienza, editor. Information Extraction. Springer-Verlag, Lecture Notes in Artificial Intelligence, Rome, 1997.
  • Fernando Pereira, Naftali Tishby, and Lillian Lee. Distributional clustering of English words. In: Proceedings of the 31st Annual Meeting of the Assn. for Computational Linguistics, pages 183{190, Columbus, OH, June 1993.
  • Ellen Rilo . Automatically constructing a dictionary for information extraction tasks. In: Proceedings of Eleventh National Conference on Artificial Intelligence (AAAI-93), pages 811{816. The AAAI Press/MIT Press, 1993.
  • Ellen Rilo . Automatically generating extraction patterns from untagged text. In: Proceedings of Thirteenth National Conference on Artificial Intelligence (AAAI-96), pages 1044{1049. The AAAI Press/MIT Press, 1996.
  • Ellen Rilo and Rosie Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In: Proceedings of Sixteenth National Conference on Artificial Intelligence (AAAI-99), Orlando, Florida, 1999.
  • Naomi Sager, Carol Friedman, and Margaret Lyman. Medical Language Processing: Computer Management of Narrative Data. Addison Wesley, 1987.
  • Yukata Sasaki. Applying type-oriented ILP to IE rule generation. In: Proceedings of Workshop on Machine Learning and Information Extraction (AAAI 99), Orlando, FL, July 1999.
  • Stephen Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 44(1-3):233{272, 1999.
  • Tomek Strzalkowski and Jose Perez Carballo. Natural language information retrieval: TREC-4 report. In: Proceedings of Fourth Text Retrieval Conference, Gaithersburg, MD, November 1995.
  • Tomek Strzalkowski and Jin Wang. A self-learning universal concept spotter. In: Proceedings of 16th International Conference on Computational Linguistics (COLING-96), Copenhagen, August 1996.
  • Pasi Tapanainen and Timo J¨arvinen. A non-projective dependency parser. In: Proceedings of the 5th Conference on Applied Natural Language Processing, pages 64{71, Washington, D.C., April (1997). ACL.
  • Cynthia A. Thompson, Mary Elaine Cali, and Raymond Mooney. Active learning for natural language parsing and information extraction. In: Proceedings of 16th International Conference on Machine Learning, pages 406{414. Morgan Kaufmann, San Francisco, CA, 1999.
  • Roman Yangarber. PET: The Proteus Extraction Toolkit. User and Developer Manual, 1997.
  • Roman Yangarber and Ralph Grishman. Customization of information extraction systems. In Paola Velardi, editor, International Workshop on Lexically Driven Information Extraction, pages 1{11, Frascati, Italy, July 1997. Universita di Roma.
  • Roman Yangarber and Ralph Grishman. NYU: Description of the Proteus/PET system as used for MUC-7 ST. In MUC-7: Seventh Message Understanding Conference, Columbia, MD, April 1998.
  • Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and Silja Huttunen. Automatic acquisition of domain knowledge for information extraction. In: Proceedings of the 18th International Conference on Computational Linguistics(COLING 2000), Saarbrücken, Germany, August 2000.
  • Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and Silja Huttunen. Unsupervised discovery of scenario-level patterns for information extraction. In: Proceedings of Conference on Applied Natural Language Processing (ANLP-NAACL'00), Seattle, WA, April 2000.
  • David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 189{196, Cambridge, MA, July 24{26 1995. ACM Press.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2001 ScenearioCustomizationForIERoman YangarberScenario Customization for Information ExtractionDoctoral Dissertationhttp://nlp.cs.nyu.edu/publication/papers/yangarber thesis.ps.gz10.1.1.75.90172001