2005 RobustIdentificationOfFuzzyDuplicates

(Chaudhuri et al., 2005) ⇒ Surajit Chaudhuri, Venkatesh Ganti, Rajeev Motwani. (2005). “Robust Identification of Fuzzy Duplicates.” In: Proceedings of the 21st International Conference on Data Engineering (ICDE 2005). doi:10.1109/ICDE.2005.125

Subject Headings: Record Coreference Resolution Algorithm

Notes

Cited By

~99 http://scholar.google.com/scholar?cites=13467241989349531787

Quotes

Abstract

Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples, which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm.

References

(CGGM03) ⇒ Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani. (2003). “Robust and efficient fuzzy match for online data cleaning.” In: Proceedings of the 2003 [[ACM SIGMOD] Conference, June 09-12, 2003.
R. Ananthakrishna, S. Chaudhuri, and Venkatesh Ganti. Eliminating fuzzy duplicates in data warehouses. In: Proceedings of VLDB, Hong Kong, 2002.
Ricardo Baeza-Yates and Gonzalo Navarro. A practical index for text retrieval allowing errors. In R. Monge, editor, Proceedings of the XXIII Latin American Conference on Informatics (CLEI'97), Valparaiso, Chile, 1997.
Ricardo A. Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1999
A. Broder, On the Resemblance and Containment of Documents, Proceedings of the Compression and Complexity of Sequences 1997, p.21, June 11-13, 1997
Paolo Ciaccia, Marco Patella, Pavel Zezula, M-tree: An Efficient Access Method for Similarity Search in Metric Spaces, Proceedings of the 23rd International Conference on Very Large Data Bases, p.426-435, August 25-29, 1997
Edith Cohen, Size-estimation framework with applications to transitive closure and reachability, Journal of Computer and System Sciences, v.55 n.3, p.441-453, Dec. 1997 doi:10.1006/jcss.1997.1534
William W. Cohen, Integration of heterogeneous databases without common domains using queries based on textual similarity, Proceedings of the 1998 ACM SIGMOD Conference, p.201-212, June 01-04, 1998, Seattle, Washington, United States
William W. Cohen, Data integration using similarity joins and a word-based information representation language, ACM Transactions on Information Systems (TOIS), v.18 n.3, p.288-321, July 2000 doi:10.1145/352595.352598
Edith Cohen, David D. Lewis, Approximating Matrix Multiplication for Pattern Recognition Tasks, Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms, p.682-691, January 05-07, 1997, New Orleans, Louisiana, United States
W. Cohen and J. Richman. Learning to match and cluster entity names. In: Proceedings of SIGKDD, Edmonton, July 2002.
Volker Gaede, Oliver Günther, Multidimensional access methods, ACM Computing Surveys (CSUR), v.30 n.2, p.170-231, June 1998 doi:10.1145/280277.280279
Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
Mauricio A. Hernández, Salvatore J. Stolfo, The merge/purge problem for large databases, Proceedings of the 1995 ACM SIGMOD Conference, p.127-138, May 22-25, 1995, San Jose, California, United States
Piotr Indyk, Rajeev Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.604-613, May 24-26, 1998, Dallas, Texas, United States doi:10.1145/276698.276876
15. P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In A. Tarlecki, editor, Mathematical Foundations of Computer Science, 1991.
Rajeev Motwani, Prabhakar Raghavan, Randomized algorithms, Cambridge University Press, New York, NY, 1995
Gonzalo Navarro, Ricardo Baeza-Yates, E. Sutinen, and J. Tarhio. Indexing methods for approximate string matching. IEEE Data Engineering Bulletin, 24(4):19--27, 2001.
Gonzalo Navarro, Erkki Sutinen, Jani Tanninen, Jorma Tarhio, Indexing Text with Approximate q-Grams, Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, p.350-363, June 21-23, 2000
Gonzalo Navarro, Searching in metric spaces by spatial approximation, The VLDB Journal — The International Journal on Very Large Data Bases, v.11 n.1, p.28-46, August 2002 doi:10.1007/s007780200060
Sunita Sarawagi, Anuradha Bhamidipaty, Interactive deduplication using active learning, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada doi:10.1145/775047.775087
B. Schneier. Applied Cryptography John Wiley, 1996.
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195--197, 1981.
Trillium Software. http://www.trilliumsoft.com
W. Winkler. The state of record linkage and current research problems. http://www.census.gov/srd/papers/pdf/rr99-04.pdf

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2005 RobustIdentificationOfFuzzyDuplicates	Rajeev Motwani Surajit Chaudhuri Venkatesh Ganti			Robust Identification of Fuzzy Duplicates		Proceedings of the 21st International Conference on Data Engineering	http://ieeexplore.ieee.org/xpl/freeabs all.jsp?arnumber=1410199	10.1109/ICDE.2005.125		2005