2003 RobustEfficientFuzzyMatch

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Entity Record Deduplication Algorithm, Data Cleansing Task

Notes

Cited By

~276 http://scholar.google.com/scholar?cites=1027889780569772575

Quotes

Abstract

  • To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation.A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets.

References

  • R. Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti. Eliminating fuzzy duplicates in data warehouses. In: Proceedings of VLDB, Hong Kong, 2002.
  • Ricardo Baeza-Yates and Gonzalo Navarro. A practical index for text retrieval allowing errors. In R. Monge, editor, Proceedings of the XXIII Latin American Conference on Informatics (CLEI'97), Valparaiso, Chile, 1997.
  • Ricardo A. Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1999
  • A. Broder, On the Resemblance and Containment of Documents, Proceedings of the Compression and Complexity of Sequences 1997, p.21, June 11-13, 1997
  • Paolo Ciaccia, Marco Patella, Pavel Zezula, M-tree: An Efficient Access Method for Similarity Search in Metric Spaces, Proceedings of the 23rd International Conference on Very Large Data Bases, p.426-435, August 25-29, 1997
  • Edith Cohen, Size-estimation framework with applications to transitive closure and reachability, Journal of Computer and System Sciences, v.55 n.3, p.441-453, Dec. 1997 doi:10.1006/jcss.1997.1534 7 William W. Cohen, Integration of heterogeneous databases without common domains using queries based on textual similarity, Proceedings of the 1998 ACM SIGMOD Conference, p.201-212, June 01-04, 1998, Seattle, Washington, United States 8 William W. Cohen, Data integration using similarity joins and a word-based information representation language, ACM Transactions on Information Systems (TOIS), v.18 n.3, p.288-321, July 2000 doi:10.1145/352595.352598
  • Edith Cohen, David D. Lewis, Approximating Matrix Multiplication for Pattern Recognition Tasks, Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms, p.682-691, January 05-07, 1997, New Orleans, Louisiana, United States
  • 10 W. Cohen and J. Richman. Learning to match and cluster entity names. In: Proceedings of SIGKDD, Edmonton, July (2002). 11 Volker Gaede, Oliver Günther, Multidimensional access methods, ACM Computing Surveys (CSUR), v.30 n.2, p.170-231, June 1998 doi:10.1145/280277.280279
  • Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001 13 Mauricio A. Hernández, Salvatore J. Stolfo, The merge/purge problem for large databases, Proceedings of the 1995 ACM SIGMOD Conference, p.127-138, May 22-25, 1995, San Jose, California, United States 14 Piotr Indyk, Rajeev Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.604-613, May 24-26, 1998, Dallas, Texas, United States doi:10.1145/276698.276876
  • P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In A. Tarlecki, editor, Mathematical Foundations of Computer Science, 1991.
  • Rajeev Motwani, Prabhakar Raghavan, Randomized algorithms, Cambridge University Press, New York, NY, 1995
  • Gonzalo Navarro, Ricardo Baeza-Yates, E. Sutinen, and J. Tarhio. Indexing methods for approximate string matching. IEEE Data Engineering Bulletin, 24(4):19--27, 2001.
  • Gonzalo Navarro, Erkki Sutinen, Jani Tanninen, Jorma Tarhio, Indexing Text with Approximate q-Grams, Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, p.350-363, June 21-23, 2000
  • Gonzalo Navarro, Searching in metric spaces by spatial approximation, The VLDB Journal — The International Journal on Very Large Data Bases, v.11 n.1, p.28-46, August 2002 doi:10.1007/s007780200060 20 Sunita Sarawagi, Anuradha Bhamidipaty, Interactive deduplication using active learning, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada doi:10.1145/775047.775087
  • B. Schneier. Applied Cryptography John Wiley, 1996.
  • T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195--197, 1981.
  • Trillium Software. http://www.trilliumsoft.com
  • W. Winkler. The state of record linkage and current research problems. http://www.census.gov/srd/papers/pdf/rr99-04.pdf,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2003 RobustEfficientFuzzyMatchRajeev Motwani
Surajit Chaudhuri
Venkatesh Ganti
Kris Ganjam
Robust and Efficient Fuzzy Match for Online Data CleaningProceedings of the 2003 ACM SIGMOD Conferencehttp://research.microsoft.com/pubs/68895/fm sigmod03.pdf10.1145/872757.8727962003