2008 EntityCatOverLargeDocCollections

(Ganti et al., 2008) ⇒ Venkatesh Ganti, Arnd Christian Konig, and Rares Vernica. (2008). “Entity Categorization Over Large Document Collections.” In: Proceedings of KDD 2008 (KDD-2008).

Subject Headings: Supervised Named Entity Recognition Algorithm, Supervised Relation Recognition Algorithm.

Notes

Cited By

Quotes

Abstract

Extracting entities (such as people, movies) from documents and identifying the categories (such as painter, writer) they belong to enable structured querying and data analysis over unstructured document collections. In this paper, we focus on the problem of categorizing extracted entities. Most prior approaches developed for this task only analyzed the local document context within which entities occur. In this paper, we significantly improve the accuracy of entity categorization by (i) considering an entity’s context across multiple documents containing it, and (ii) exploiting existing large lists of related entities (e.g., lists of actors, directors, books). These approaches introduce computational challenges because (a) the context of entities has to be aggregated across several documents and (b) the lists of related entities may be very large. We develop techniques to address these challenges. We present a thorough experimental study on real data sets that demonstrates the increase in accuracy and the scalability of our approach es.

1. Introduction

One particular area of recent interest has been the automatic extraction of unary relations (such as is-a-painter, is-a-researcher, or is-a-camera) and binary relations (such as is-a-painter-of, isauthor-of) between named entities (e.g., [1, 6, 15, 23]). Here, we differentiate between two approaches: “open” relation extraction [6] where arbitrary relations are extracted and targeted relation extraction where only a small number of known target relations (e.g., actors, painters, electronic products) are extracted.

In this paper, we focus on the extraction of targeted relations. We view the targeted relation extraction as that of categorizing named entities, into a set of target classes such as painters, researchers, etc. Henceforth, we use the terms unary relation extraction and entity categorization interchangeably.

For example, we can use the combination of features such as '[Entity] presents results' and '[Entity] publishes', each of which is not sufficiently predictive by itself to allow extraction of the tuple (Entity,is-a-researcher) (after all, companies present results and newspapers publish), but which – when combined – make it very likely that the entity in question is a researcher."

…

References

1. E. Agichtein. Scaling Information Extraction to Large Document Collections. IEEE Data Eng. Bull., 28(4):3--10, 2005.
2. E. Agichtein and L. Gravano. Querying Text Databases for efficient Information Extraction. In ICDE, 2003.
3. E. Agichtein and S. Sarawagi. Scalable Information Extraction and integration. In ACM SIGKDD, 2006.
4. Douglas E. Appelt, and D. Israel. Introduction to Information Extraction Technology. IJCAI-99 Tutorial, 1999..
5. Nikolay Archak, Anindya Ghose, Panagiotis G. Ipeirotis, Show me the money!: deriving the pricing power of product features by mining consumer reviews, Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 12-15, 2007, San Jose, California, USA doi:10.1145/1281192.1281202
6. Michele Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and Oren Etzioni. Open Information Extraction from the Web. In IJCAI, pages 2670--2676, 2007..
7. Burton H. Bloom, Space/time trade-offs in hash coding with allowable errors, Communications of the ACM, v.13 n.7, p.422-426, July 1970 doi:10.1145/362686.362692
8. Michael J. Cafarella, Michele Banko, and Oren Etzioni. Relational Web Search. In WWW Conference, 2006..
9. Michael J. Cafarella, Oren Etzioni, A search engine for natural language applications, Proceedings of the 14th International Conference on World Wide Web, May 10-14, 2005, Chiba, Japan doi:10.1145/1060745.1060811
10. Amit Chandel, P. C. Nagesh, Sunita Sarawagi, Efficient Batch Top-k Search for Dictionary-based Entity Recognition, Proceedings of the 22nd International Conference on Data Engineering, p.28, April 03-07, 2006 doi:10.1109/ICDE.2006.55
11. W. Cohen and A. McCallum. Information Extraction and Integration: an Overview. In SIGKDD, 2004.
12. Graham Cormode, S. Muthukrishnan, An improved data stream summary: the count-min sketch and its applications, Journal of Algorithms, v.55 n.1, p.58-75, April 2005 doi:10.1016/j.jalgor.2003.12.001 .
13. Graham Cormode, S. Muthukrishnan, What's hot and what's not: tracking most frequent items dynamically, ACM Transactions on Database Systems (TODS), v.30 n.1, p.249-278, March 2005 doi:10.1145/1061318.1061325
14. D. Downey, Oren Etzioni, and S. Soderland. A Probabilistic Model of Redundancy in Information Extraction. In IJCAI, 2005.
15. Ronen Feldman, B. Rosenfeld, S. Soderland, and Oren Etzioni. Self-supervised Relation Extraction from the Web. In ISMIS, 2006.. * 16. Goetz Graefe, Query evaluation techniques for large databases, ACM Computing Surveys (CSUR), v.25 n.2, p.73-169, June 1993 doi:10.1145/152610.152611 .
17. Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, Luis Gravano, To search or to crawl?: towards a query optimizer for text-centric tasks, Proceedings of the 2006 ACM SIGMOD Conference, June 27-29, 2006, Chicago, IL, USA doi:10.1145/1142473.1142504 .
18. Arnd Christian König, Eric D. Brill, Reducing the human overhead in text categorization, Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 20-23, 2006, Philadelphia, PA, USA doi:10.1145/1150402.1150474
19. Andrew McCallum, Wei Li, Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons, Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, p.188-191, May 31, 2003, Edmonton, Canada doi:10.3115/1119176.1119206 .
20. Qiaozhu Mei, ChengXiang Zhai, A mixture model for contextual text mining, Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 20-23, 2006, Philadelphia, PA, USA doi:10.1145/1150402.1150482
21. Gonzalo Navarro, Mathieu Raffinot, Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences, Cambridge University Press, New York, NY, 2002
22. John C. Platt, Fast training of support vector machines using sequential minimal optimization, Advances in kernel methods: support vector learning, MIT Press, Cambridge, MA, 1999.
23. Benjamin Rosenfeld, Ronen Feldman, Moshe Fresko, Jonathan Schler, Yonatan Aumann, TEG: a hybrid approach to Information Extraction, Proceedings of the thirteenth ACM International Conference on Information and knowledge management, November 08-13, 2004, Washington, D.C., USA doi:10.1145/1031171.1031280
24. W. Winkler. The State of Record Linkage and Current Research Problems. Technical report, U.S. Bureau of the Census, 1999.
25. GuoDong Zhou, Jian Su, Named entity recognition using an HMM-based chunk tagger, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, July 07-12, 2002, Philadelphia, Pennsylvania doi:10.3115/1073083.1073163,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2008 EntityCatOverLargeDocCollections	Venkatesh Ganti Arnd Christian Konig Rares Vernica			Entity Categorization Over Large Document Collections		Proceedings of KDD 2008	http://dx.doi.org/10.1145/1401890.1401927	10.1145/1401890.1401927		2008