2005 RobustIdentificationOfFuzzyDuplicates

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Record Coreference Resolution Algorithm

Notes

Cited By

~99 http://scholar.google.com/scholar?cites=13467241989349531787

Quotes

Abstract

Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples, which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm.



References

  • (CGGM03) ⇒ Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani. (2003). “Robust and efficient fuzzy match for online data cleaning.” In: Proceedings of the 2003 [[ACM SIGMOD] Conference, June 09-12, 2003.
  • R. Ananthakrishna, S. Chaudhuri, and Venkatesh Ganti. Eliminating fuzzy duplicates in data warehouses. In: Proceedings of VLDB, Hong Kong, 2002.
  • Ricardo Baeza-Yates and Gonzalo Navarro. A practical index for text retrieval allowing errors. In R. Monge, editor, Proceedings of the XXIII Latin American Conference on Informatics (CLEI'97), Valparaiso, Chile, 1997.
  • Ricardo A. Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1999
  • A. Broder, On the Resemblance and Containment of Documents, Proceedings of the Compression and Complexity of Sequences 1997, p.21, June 11-13, 1997
  • Paolo Ciaccia, Marco Patella, Pavel Zezula, M-tree: An Efficient Access Method for Similarity Search in Metric Spaces, Proceedings of the 23rd International Conference on Very Large Data Bases, p.426-435, August 25-29, 1997
  • Edith Cohen, Size-estimation framework with applications to transitive closure and reachability, Journal of Computer and System Sciences, v.55 n.3, p.441-453, Dec. 1997 doi:10.1006/jcss.1997.1534
  • William W. Cohen, Integration of heterogeneous databases without common domains using queries based on textual similarity, Proceedings of the 1998 ACM SIGMOD Conference, p.201-212, June 01-04, 1998, Seattle, Washington, United States
  • William W. Cohen, Data integration using similarity joins and a word-based information representation language, ACM Transactions on Information Systems (TOIS), v.18 n.3, p.288-321, July 2000 doi:10.1145/352595.352598
  • Edith Cohen, David D. Lewis, Approximating Matrix Multiplication for Pattern Recognition Tasks, Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms, p.682-691, January 05-07, 1997, New Orleans, Louisiana, United States
  • W. Cohen and J. Richman. Learning to match and cluster entity names. In: Proceedings of SIGKDD, Edmonton, July 2002.
  • Volker Gaede, Oliver Günther, Multidimensional access methods, ACM Computing Surveys (CSUR), v.30 n.2, p.170-231, June 1998 doi:10.1145/280277.280279
  • Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
  • Mauricio A. Hernández, Salvatore J. Stolfo, The merge/purge problem for large databases, Proceedings of the 1995 ACM SIGMOD Conference, p.127-138, May 22-25, 1995, San Jose, California, United States
  • Piotr Indyk, Rajeev Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.604-613, May 24-26, 1998, Dallas, Texas, United States doi:10.1145/276698.276876
  • 15. P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In A. Tarlecki, editor, Mathematical Foundations of Computer Science, 1991.
  • Rajeev Motwani, Prabhakar Raghavan, Randomized algorithms, Cambridge University Press, New York, NY, 1995
  • Gonzalo Navarro, Ricardo Baeza-Yates, E. Sutinen, and J. Tarhio. Indexing methods for approximate string matching. IEEE Data Engineering Bulletin, 24(4):19--27, 2001.
  • Gonzalo Navarro, Erkki Sutinen, Jani Tanninen, Jorma Tarhio, Indexing Text with Approximate q-Grams, Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, p.350-363, June 21-23, 2000
  • Gonzalo Navarro, Searching in metric spaces by spatial approximation, The VLDB Journal — The International Journal on Very Large Data Bases, v.11 n.1, p.28-46, August 2002 doi:10.1007/s007780200060
  • Sunita Sarawagi, Anuradha Bhamidipaty, Interactive deduplication using active learning, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada doi:10.1145/775047.775087
  • B. Schneier. Applied Cryptography John Wiley, 1996.
  • T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195--197, 1981.
  • Trillium Software. http://www.trilliumsoft.com
  • W. Winkler. The state of record linkage and current research problems. http://www.census.gov/srd/papers/pdf/rr99-04.pdf,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 RobustIdentificationOfFuzzyDuplicatesRajeev Motwani
Surajit Chaudhuri
Venkatesh Ganti
Robust Identification of Fuzzy DuplicatesProceedings of the 21st International Conference on Data Engineeringhttp://ieeexplore.ieee.org/xpl/freeabs all.jsp?arnumber=141019910.1109/ICDE.2005.1252005