2006 EffiLinkingTextDocs

Jump to: navigation, search

Subject Headings: Entity Mention Normalization Algorithm, TF-IDF Ranking Function


Cited By






Faced with growing knowledge management needs, enterprises are increasingly realizing the importance of interlinking critical business information distributed across structured and unstructured data sources. We present a novel system, called EROCS, for linking a given text document with relevant structured data. EROCS views the structured data as a predefined set of "entities" and identifies the entities that best match the given document. EROCS also embeds the identified entities in the document, effectively creating links between the structured data and segments within the document. Unlike prior approaches, EROCS identifies such links even when the relevant entity is not explicitly mentioned in the document. EROCS uses an efficient algorithm that performs this task keeping the amount of information retrieved from the database at a minimum. Our evaluation shows that EROCS achieves high accuracy with reasonable overheads.


  • 1 Eugene Agichtein, Venkatesh Ganti, Mining reference tables for automatic text segmentation, Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 22-25, 2004, Seattle, WA, USA
  • 2 AGRAWAL, S., CHAUDHURI, S., and DAS, G. DBXplorer: A System for Keyword-based Search over Relational databases. In ICDE (2002).
  • 3 Ricardo A. Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1999
  • 4 Thierry Barsalou, Gio Wiederhold, View objects for relational databases, 1990
  • 5 Thierry Barsalou, Niki Siambela, Arthur M. Keller, Gio Wiederhold, Updating relational databases through object-based views, Proceedings of the 1991 ACM SIGMOD Conference, p.248-257, May 29-31, 1991, Denver, Colorado, United States
  • 6 Arvind Hulgeri, Charuta Nakhe, Keyword Searching and Browsing in Databases using BANKS, Proceedings of the 18th International Conference on Data Engineering, p.431, February 26-March 01, 2002
  • 7 BORTHWICK, A., STERLING, J., Eugene Agichtein, and GRISHMAN, R. Exploiting diverse sources via maximum entropy in named entity recognition. In Sixth Workshop on Very Large Corpora (1998).
  • 8 Soumen Chakrabarti, Breaking through the syntax barrier: searching with entities and relations, Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, p.9-16, September 20-24, 2004, Pisa, Italy
  • 9 Amit Chandel, P. C. Nagesh, Sunita Sarawagi, Efficient Batch Top-k Search for Dictionary-based Entity Recognition, Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), p.28, April 03-07, 2006 doi:10.1109/ICDE.2006.55
  • 10 (Chaudhuri et al., 2005) ⇒ Surajit Chaudhuri, Venkatesh Ganti, Rajeev Motwani. (2005). “Robust Identification of Fuzzy Duplicates." Proceedings of the 21st International Conference on Data Engineering (ICDE'05), p.865-876, April 05-08, 2005 doi:10.1109/ICDE.2005.125
  • 11 Peter Pin-Shan Chen, The entity-relationship model — toward a unified view of data, ACM Transactions on Database Systems (TODS), v.1 n.1, p.9-36, March 1976 doi:10.1145/320434.320440
  • 12 (CoSa, 2005) ⇒ William W. Cohen, Sunita Sarawagi. (2005). “Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods.” In: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 22-25, 2004, Seattle, WA, USA doi:10.1145/1014052.1014065
  • 13 AnHai Doan, Alon Y. Halevy, Semantic-integration research in the database community, AI Magazine, v.26 n.1, p.83-94, March 2005
  • 14 HRISTIDIS, V., GRAVANO, L., and PAPAKONSTANTINOU, Y. Efficient IR-Style Keyword Search over Relational Databases. In VLDB (2003).
  • 15 IBM. IBM DB2 UDB Net Search Extender : Administration and User Guide (version 8.1), 2003.
  • 16 Xin Li, Paul Morie, Dan Roth, Semantic integration in text: from ambiguous names to identifiable entities, AI Magazine, v.26 n.1, p.45-58, March 2005
  • 17 Imran R. Mansuri, Sunita Sarawagi, Integrating Unstructured Data into Relational Databases, Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), p.29, April 03-07, 2006 doi:10.1109/ICDE.2006.83
  • 18 William J. Premerlani, Michael R. Blaha, An approach for reverse engineering of relational databases, Communications of the ACM, v.37 n.5, p.42-ff., May 1994 doi:10.1145/175290.175293
  • 19 Prasan Roy, Mukesh Mohania, Bhuvan Bamba, Shree Raman, Towards automatic association of relevant unstructured content with structured query results, Proceedings of the 14th ACM International Conference on Information and knowledge management, October 31-November 05, 2005, Bremen, Germany doi:10.1145/1099554.1099676
  • 20 Sunita Sarawagi Automation in information extraction and integration (tutorial). In VLDB (2002).
  • 21 Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, Jeffrey F. Naughton, Relational Databases for Querying XML Documents: Limitations and Opportunities, Proceedings of the 25th International Conference on Very Large Data Bases, p.302-314, September 07-10, 1999
  • 22 Mark H. Walker, Nanette J. Eaton, Nanette Eaton, Microsoft Office Visio 2003 Inside Out, Microsoft Press, Redmond, WA, 2003
  • 23 WINKLER, W. E. The state of record linkage and current research problems. Tech. rep., U.S. Census Bureau, (1999).


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2006 EffiLinkingTextDocsVenkatesan T. Chakaravarthy
Himanshu Gupta
Prasan Roy
Mukesh Mohania
Efficiently Linking Text Documents with Relevant Structured InformationProceedings of the 32nd International Conference on Very Large Data Baseshttp://www.vldb.org/conf/2006/p667-chakaravarthy.pdf2006