2003 RobustEfficientFuzzyMatch

(Chaudhuri et al., 2003) ⇒ Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani. (2003). “Robust and Efficient Fuzzy Match for Online Data Cleaning.” In: Proceedings of the 2003 ACM SIGMOD Conference (SIGMOD 2003). doi:10.1145/872757.872796

Subject Headings: Entity Record Deduplication Algorithm, Data Cleansing Task

Notes

Cited By

~276 http://scholar.google.com/scholar?cites=1027889780569772575

(Chaudhuri et al., 2005) ⇒ Surajit Chaudhuri, Venkatesh Ganti, Rajeev Motwani. (2005). “Robust Identification of Fuzzy Duplicates.” Proceedings of the 21st International Conference on Data Engineering (ICDE'05), p.865-876, April 05-08, 2005 doi:10.1109/ICDE.2005.125
(Winkler, 2006) ⇒ William E. Winkler. (2006). “Overview of record linkage and current research directions.” Technical Report Statistical Research Report Series RRS2006/02, U.S. Bureau of the Census.
- Chaudhuri et al (2003) showed how to create index structures that allow for certain types of typographical error in matching within a database. In their application, their methods reduced computation by a factor of three in comparison with naïve methods that compare every pair of records in two files.

Quotes

Abstract

To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation.A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets.

References

R. Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti. Eliminating fuzzy duplicates in data warehouses. In: Proceedings of VLDB, Hong Kong, 2002.
Ricardo Baeza-Yates and Gonzalo Navarro. A practical index for text retrieval allowing errors. In R. Monge, editor, Proceedings of the XXIII Latin American Conference on Informatics (CLEI'97), Valparaiso, Chile, 1997.
Ricardo A. Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1999
A. Broder, On the Resemblance and Containment of Documents, Proceedings of the Compression and Complexity of Sequences 1997, p.21, June 11-13, 1997
Paolo Ciaccia, Marco Patella, Pavel Zezula, M-tree: An Efficient Access Method for Similarity Search in Metric Spaces, Proceedings of the 23rd International Conference on Very Large Data Bases, p.426-435, August 25-29, 1997
Edith Cohen, Size-estimation framework with applications to transitive closure and reachability, Journal of Computer and System Sciences, v.55 n.3, p.441-453, Dec. 1997 doi:10.1006/jcss.1997.1534 7 William W. Cohen, Integration of heterogeneous databases without common domains using queries based on textual similarity, Proceedings of the 1998 ACM SIGMOD Conference, p.201-212, June 01-04, 1998, Seattle, Washington, United States 8 William W. Cohen, Data integration using similarity joins and a word-based information representation language, ACM Transactions on Information Systems (TOIS), v.18 n.3, p.288-321, July 2000 doi:10.1145/352595.352598
Edith Cohen, David D. Lewis, Approximating Matrix Multiplication for Pattern Recognition Tasks, Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms, p.682-691, January 05-07, 1997, New Orleans, Louisiana, United States
10 W. Cohen and J. Richman. Learning to match and cluster entity names. In: Proceedings of SIGKDD, Edmonton, July (2002). 11 Volker Gaede, Oliver Günther, Multidimensional access methods, ACM Computing Surveys (CSUR), v.30 n.2, p.170-231, June 1998 doi:10.1145/280277.280279
Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001 13 Mauricio A. Hernández, Salvatore J. Stolfo, The merge/purge problem for large databases, Proceedings of the 1995 ACM SIGMOD Conference, p.127-138, May 22-25, 1995, San Jose, California, United States 14 Piotr Indyk, Rajeev Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.604-613, May 24-26, 1998, Dallas, Texas, United States doi:10.1145/276698.276876
P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In A. Tarlecki, editor, Mathematical Foundations of Computer Science, 1991.
Rajeev Motwani, Prabhakar Raghavan, Randomized algorithms, Cambridge University Press, New York, NY, 1995
Gonzalo Navarro, Ricardo Baeza-Yates, E. Sutinen, and J. Tarhio. Indexing methods for approximate string matching. IEEE Data Engineering Bulletin, 24(4):19--27, 2001.
Gonzalo Navarro, Erkki Sutinen, Jani Tanninen, Jorma Tarhio, Indexing Text with Approximate q-Grams, Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, p.350-363, June 21-23, 2000
Gonzalo Navarro, Searching in metric spaces by spatial approximation, The VLDB Journal — The International Journal on Very Large Data Bases, v.11 n.1, p.28-46, August 2002 doi:10.1007/s007780200060 20 Sunita Sarawagi, Anuradha Bhamidipaty, Interactive deduplication using active learning, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada doi:10.1145/775047.775087
B. Schneier. Applied Cryptography John Wiley, 1996.
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195--197, 1981.
Trillium Software. http://www.trilliumsoft.com
W. Winkler. The state of record linkage and current research problems. http://www.census.gov/srd/papers/pdf/rr99-04.pdf

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2003 RobustEfficientFuzzyMatch	Rajeev Motwani Surajit Chaudhuri Venkatesh Ganti Kris Ganjam			Robust and Efficient Fuzzy Match for Online Data Cleaning		Proceedings of the 2003 ACM SIGMOD Conference	http://research.microsoft.com/pubs/68895/fm sigmod03.pdf	10.1145/872757.872796		2003

2003 RobustEfficientFuzzyMatch

Notes

Cited By

Quotes

Abstract

References

Navigation menu

Search