2007 CollectiveERinRelData

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Supervised Record Coreference Resolution Algorithm, Relational Record Coreference Resolution Algorithm, Relational Similarity Function.

Notes

Cited By

Quotes

Abstract

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query processing and knowledge extraction. These problems can be alleviated through the use of entity resolution. Entity resolution involves discovering the underlying entities and mapping each database reference to these entities. Traditionally, entities are resolved using pairwise similarity over the attributes of references. However, there is often additional relational information in the data. Specifically, references to different entities may cooccur. In these cases, collective entity resolution, in which entities for cooccurring references are determined jointly rather than independently, can improve entity resolution accuracy. We propose a novel relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities, and we give an efficient implementation. We investigate the impact that different relational similarity measures have on entity resolution quality. We evaluate our collective entity resolution algorithm on multiple real-world databases. We show that it improves entity resolution performance over both attribute-based baselines and over algorithms that consider relational information but do not resolve entities collectively. In addition, we perform detailed experiments on synthetically generated data to identify data characteristics that favor collective relational resolution over purely attribute-based algorithms.


References

  • Adamic, L. and Adar, E. (2003). Friends and neighbors on the Web. Social Networ. 25, 3 (July), 211--230.
  • Ananthakrishna, R., Chaudhuri, S., and Ganti, V. (2002). Eliminating fuzzy duplicates in data warehouses. In The International Conference on Very Large Databases (VLDB). Hong Kong, China.
  • Benjelloun, O., Garcia-Molina, H., Su, Q., and Widom, J. (2005). Swoosh: A generic approach to entity resolution. Tech. rep., Stanford University. (March).
  • Indrajit Bhattacharya, Lise Getoor, Iterative record linkage for cleaning and integration, Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, June 13, 2004, Paris, France doi:10.1145/1008694.1008697
  • Bhattacharya, I. and Getoor, L. 2006a. Mining graph data. In Entity Resolution in Graphs. L. Holder and D. Cook, Eds. John Wiley.
  • Bhattacharya, I. and Getoor, L. 2006b. A latent dirichlet model for unsupervised entity resolution. In The SIAM Conference on Data Mining (SIAM-SDM). Bethesda, MD..
  • Indrajit Bhattacharya, Lise Getoor, Louis Licamele, Query-time entity resolution, Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 20-23, 2006, Philadelphia, PA, USA doi:10.1145/1150402.1150463.
  • (Bilenko and Mooney, 2003) ⇒ Mikhail Bilenko, and Raymond Mooney. (2003). “Adaptive Duplicate Detection Using Learnable String Similarity Measures.” In: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003).
  • Mikhail Bilenko, Raymond Mooney, William Cohen, Pradeep Ravikumar, Stephen Fienberg, Adaptive Name Matching in Information Integration, IEEE Intelligent Systems, v.18 n.5, p.16-23, September 2003 doi:10.1109/MIS.2003.1234765.
  • Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani, Robust and efficient fuzzy match for online data cleaning, Proceedings of the 2003 ACM SIGMOD Conference, June 09-12, 2003, San Diego, California doi:10.1145/872757.872796.
  • William W. Cohen, Data integration using similarity joins and a word-based information representation language, ACM Transactions on Information Systems (TOIS), v.18 n.3, p.288-321, July 2000 doi:10.1145/352595.352598
  • William W. Cohen, Ravikumar, P., and Fienberg, S. (2003). A comparison of string distance metrics for name-matching tasks. In The IJCAI Workshop on Information Integration on the Web (IIWeb). Acapulco, Mexico..
  • William W. Cohen, Jacob Richman, Learning to match and cluster large high-dimensional data sets for data integration, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada doi:10.1145/775047.775116.
  • Xin Dong, Alon Halevy, Jayant Madhavan, Reference reconciliation in complex information spaces, Proceedings of the 2005 ACM SIGMOD Conference, June 14-16, 2005, Baltimore, Maryland doi:10.1145/1066157.1066168
  • Fellegi, I. and Sunter, A. 1969. A theory for record linkage. J. Amer. Statis. Assoc. 64, 1183--1210..
  • C. Lee Giles, Kurt D. Bollacker, Steve Lawrence, CiteSeer: an automatic citation indexing system, Proceedings of the third ACM conference on Digital libraries, p.89-98, June 23-26, 1998, Pittsburgh, Pennsylvania, United States doi:10.1145/276675.276685
    • Gravano, L., Ipeirotis, P., Koudas, N., and Srivastava, D. (2003). Text joins for data cleansing and integration in an RDBMS. In The IEEE International Conference on Data Engineering (ICDE). Bangalore, India..
  • (Hernández and Stolfo, 1995) ⇒ Mauricio A. Hernández, Salvatore J. Stolfo, (1995). “The Merge/Purge Problem for Large Databases.” In: Proceedings of ACM SIGMOD (1995).
  • (Kalashnikov et al., 2005) ⇒ Dmitri V. Kalashnikov, Sharad Mehrotra, and Zhaoqi Chen. (2005). “Exploiting Relationships for Domain-Independent Data Cleaning.” In: Proceedings of the SIAM International Conference on Data Mining (SIAM SDM 2005)
  • Xin Li, Paul Morie, Dan Roth, Semantic integration in text: from ambiguous names to identifiable entities, AI Magazine, v.26 n.1, p.45-58, March (2005).
  • David Liben-Nowell, Jon Kleinberg, The link prediction problem for social networks, Proceedings of the twelfth International Conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA doi:10.1145/956863.956972.
  • Andrew McCallum, Kamal Nigam, Lyle H. Ungar, Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching, Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States doi:10.1145/347090.347123
  • Andrew McCallum and Wellner, B. (2004). Conditional models of identity uncertainty with application to noun coreference. In The Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada.
  • Monge, A. and Elkan, C. (1996). The field matching problem: Algorithms and applications. In The International Conference on Knowledge Discovery and Data Mining (SIGKDD). Portland, ME.
  • Monge, A. and Elkan, C. (1997). An efficient domain-independent algorithm for detecting approximately duplicate database records. In The SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD). Tuscon, AZ..
  • Gonzalo Navarro, A guided tour to approximate string matching, ACM Computing Surveys (CSUR), v.33 n.1, p.31-88, March 2001 doi:10.1145/375360.375365
  • Newcombe, H., Kennedy, J., Axford, S., and James, A. 1959. Automatic linkage of vital records. Science 130, 954--959.
  • Pasula, H., Marthi, B., Milch, B., Russell, S., and Shpitser, I. (2003). Identity uncertainty and citation matching. In The Annual Conference on Neural Information Processing Systems (NIPS). Vancouver, Canada.
  • Pradeep Ravikumar, William W. Cohen, A hierarchical graphical model for record linkage, Proceedings of the 20th conference on Uncertainty in artificial intelligence, p.454-461, July 07-11, 2004, Banff, Canada
  • Eric Sven Ristad, Peter N. Yianilos, Learning String-Edit Distance, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.20 n.5, p.522-532, May 1998 doi:10.1109/34.682181.
  • Sunita Sarawagi, Anuradha Bhamidipaty, Interactive deduplication using active learning, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada doi:10.1145/775047.775087
  • Singla, P. and Pedro Domingos (2004). Multi-relational record linkage. In The ACM SIGKDD Workshop on Multi-Relational Data Mining (MRDM). Seattle, WA.
  • Sheila Tejada, Craig A. Knoblock, Steven Minton, Learning object identification rules for information integration, Information Systems, v.26 n.8, p.607-633, December 2001 doi:10.1016/S0306-4379(01)00042-4
  • Winkler, W. (1999). The state of record linkage and current research problems. Tech. rep., Statistical Research Division, U.S. Census Bureau, Washington, DC.
  • Winkler, W. (2002). Methods for record linkage and Bayesian networks. Tech. rep., Statistical Research Division, U.S. Census Bureau, Washington, DC.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 CollectiveERinRelDataIndrajit Bhattacharya
Lise Getoor
Collective Entity Resolution In Relational DataACM Transactions on Knowledge Discovery from Datahttp://linqs.cs.umd.edu/basilic/web/Publications/2007/bhattacharya:tkdd07/bhattacharya-tkdd.pdf10.1145/1217299.12173042007