Keywords: Snowball, Information Extraction, Bootstrapping
Quotes
Abstract
- "Text documents often contain valuable structured data that is hidden in regular English sentences. This data is best exploited if available as a relational table that we could use for answering precise queries or for running data mining tasks. We explore a technique for extracting such tables from document collections that requires only a handful of training examples from users. These examples are used to generate extraction patterns, that in turn result in new tuples being extracted from the document collection. We build on this idea and present our Snowball system. Snowball introduces novel strategies for generating patterns and extracting tuples from plain-text documents. At each iteration of the extraction process, Snowball evaluates the quality of these patterns and tuples without human intervention, and keeps only the most reliable ones for the next iteration. In this paper we also develop a scalable evaluation methodology and metrics for our task, and present a thorough experimental evaluation of Snowball and comparable techniques over a collection of more than 300,000 newspaper documents.
2.1 Generating Patterns
- Definition 1. A Snowball pattern is a 5-tuple <left, tag1, middle, tag2, right>, where tag1 and tag2 are namedentity tags, and left, middle, and right are vectors associating weights with terms.
- Definition 2. The degree of match Match(tP , tS) between two 5-tuples tP =< lP , t1, mP , t2, rP > (with tags t1 and t2) and tS =< lS, t'1, mS, t'2, rS > (with tags t'1 and t'2) is defined as: Match(tP , tS) = lP · lS + mP · mS + rP · rS (if the tags match); otherwise zero (0).
2.3 Evaluating Patterns and Tuples
- Definition 3. The confidence of a pattern P is: Conf (P) = P.positive / (P.positive + P.negative); where P.positive is the number of positive matches for P and P.negative is the number of negative matches.
- "Our definition of confidence of a pattern above is only one among many possibilities. An alternative is to account for a pattern’s coverage in addition to its selectivity. For this, we adopt a metric originally proposed by Riloff [11] to evaluate extraction patterns generated by the Autoslog-TS information extraction system, and define Conf RlogF (P) of pattern P as follows:
- Definition 4. The RlogF confidence of pattern p is: ConfRlogF(p) = p.positive (p.positive + p.negative)·log2(p.positive)
- Definition 5. The confidence of a candidate tuple T is: Conf(T) = 1 − MULT(i =0,P) (1 − {Conf(Pi) · Match(Ci, Pi)}) where P = {Pi} is the set of patterns that generated T and Ci is the context associated with an occurrence of T that matched Pi.
Conclusion
- "We only evaluated our techniques on plain text documents, and it would require future work to adopt our methodology to HTML data. While HTML tags can be naturally incorporated into Snowball’s pattern representation, it is problematic to extract named-entity tags from arbitrary HTML documents. State-of-the-art taggers rely on clues from the text surrounding each entity, which may be absent in HTML documents that often rely on visual formatting to convey information.
- "On a related note, we have assumed throughout that the attributes of the relation we extract (i.e., organization and location) correspond to named entities that our tagger can identify accurately. As we mentioned, named-entity taggers like Alembic can be extended to recognize entities that are distinct in a context-independent way (e.g., numbers, dates, proper names). For some other attributes, we will need to extend Snowball so that its pattern generation and matching could be anchored around, say, a noun phrase as opposed to a named entity as in this paper.
- "In the future, we will also generalize Snowball to relations of more than two attributes.
- "Finally, a crucial open problem is how to generalize our tuple and pattern evaluation strategy of Section 2.3 so that it does not rely on an attribute being a key for the relation.
References
1. Proceedings of the Sixth Message Understanding Conference. Morgan Kaufman, 1995.
2. Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 1998 Conference on Computational Learning Theory, 1998.
3. Sergey Brin. Extracting patterns and relations from the World-Wide Web. In Proceedings of the 1998 International Workshop on the Web and Databases (WebDB’98), March 1998.
4. William Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of the 1998 ACMInternational Conference on Management of Data (SIGMOD’98), 1998.
5. Michael Collins and Yoram Singer. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.
6. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 1999.
7. David Day, John Aberdeen, Lynette Hirschman, Robyn Kozierok, Patricia Robinson, and Marc Vilain. Mixedinitiative development of language processing systems. In Proceedings of the Fifth ACL Conference on Applied Natural Language Processing, April 1997.
8. D. Fisher, S. Soderland, J. McCarthy, F. Feng, andW. Lehnert. Description of the UMass systems as used for MUC-6. In Proceedings of the 6th Message Understanding Conference. Columbia, MD, 1995.
9. William B. Frakes and Ricardo Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992.
10. Ralph Grishman. Information extraction: Techniques and challenges. In Information Extraction (International Summer School SCIE-97). Springer-Verlag, 1997.
11. Ellen Riloff. Automatically generating extraction patterns from untagged text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 1044–1049, 1996.
12. Ellen Riloff and Rosie Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence, 1999.
13. Gerard Salton. Automatic Text Processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley, 1989.
14. Roman Yangarber and Ralph Grishman. NYU: Description of the Proteus/PET system as used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7). Morgan Kaufman, 1998.
15. D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 189–196. Cambridge, MA, 1995.
16. Jeonghee Yi and Neel Sundaresan. Mining the web for acronyms using the duality of patterns and relations. In Proceedings of the 1999 Workshop on Web Information and Data
Management, 1999.
BibTeX
@inproceedings{DBLP:conf/dl/AgichteinG00,
author = {Eugene Agichtein and
Luis Gravano},
title = {{\it nowball}: extracting relations from large plain-text
collections.},
booktitle = {ACM DL},
year = {2000},
pages = {85-94},
ee = {http://doi.acm.org/10.1145/336597.336644},
bibsource = {DBLP, http://dblp.uni-trier.de}}