2003 AdaptiveDuplicateDetection

From GM-RKB
Jump to navigation Jump to search

Subject Headings(s): Duplicate Record Detection Algorithm.

Notes

Cited By

2007

2005

Quotes

Abstract

Introduction

  • Databases frequently contain field-values and records that refer to the same entity but are not syntactically identical. Variations in representation can arise from typographical errors, misspellings, abbreviations, as well as integration of multiple data sources. Variations are particularly pronounced in data that is automatically extracted from unstructured or semi-structured documents or web pages [16, 3]. Such approximate duplicates can have many deleterious effects, including preventing data-mining algorithms from discovering important regularities. This problem is typically handled during a tedious manual data cleaning, or “de-duping”, process. Some previous work has addressed the problem of identifying duplicate records, where it was referred to as record linkage [18, 25], the merge/purge problem [10], duplicate detection [15, 22], hardening soft databases [3], reference matching [13], and entity name clustering and matching [4]. Typically, standard string similarity metrics such as edit distance [9] or vector-space cosine similarity [1] are used to determine whether two values or records are alike enough to be duplicates. Some more recent work [4, 22, 23] has investigated the use of pairing functions that combine multiple standard metrics.
  • Rather than hand-tuning a distance metric for each field, we propose to use trainable similarity measures that can be learned from small corpora of labeled examples, and thus adapt to different domains. We present two such string similarity measures. The first one utilizes the Expectation-Maximization (EM) algorithm for estimating the parameters of a generative model based on string edit distance with affine gaps. The other string similarity measure employs a Support Vector Machine (SVM) [24] to obtain a similarity estimate based on the vector-space model of text. The character-based distance is best suited for shorter strings with minor variations, while the measure based on vector-space representation is more appropriate for fields that contain longer strings with more global variations.
  • Our overall system, MARLIN (Multiply Adaptive Record Linkage with INduction), employs a two-level learning approach. First, string similarity measures are trained for every database field so that they can provide accurate estimates of string distance between values for that field. Next, a final predicate for detecting duplicate records is learned from similarity metrics applied to each of the individual fields. We again utilize Support Vector Machines for this task, and show that they outperform decision trees, the classifier used in prior work [23, 22]. We evaluate our approach on several real-world data sets containing duplicate records and show that MARLIN can lead to improved duplicate detection accuracy over traditional techniques.

Experimental Evaluation

Datasets

  • Our experiments were conducted on six datasets. Restaurant is a database of 864 restaurant names and addresses containing 112 duplicates obtained by integrating records from Fodor’s and Zagat’s guidebooks. Cora is a collection of 1295 distinct citations to 122 Computer Science research papers from the Cora Computer Science research paper search engine. The citations were segmented into multiple fields such as author, title, venue, etc. by an information extraction system, resulting in some crossover noise between the fields. Reasoning, Face, Reinforcement and Constraint are single-field datasets containing unsegmented citations to computer science papers in corresponding areas from the Citeseer scientific literature digital library.2 Reasoning contains 514 citation records that represent 196 unique papers, Face contains 349 citations to 242 papers, Reinforcement contains 406 citations to 148 papers, and Constraint contains 295 citations to 199 papers. Tables 1–3 contain sample duplicate records from the Restaurant, Cora, and Reasoning datasets.

5. Related Workd

  • The problem of identifying duplicate records in databases was originally identified by Newcombe [18] as record linkage in the context of identifying medical records of the same individual from different time periods. Fellegi and Sunter [7] developed a formal theory for record linkage and offered statistical methods for estimating matching parameters and error rates. In more recent work in statistics, Winkler proposed using EM-based methods for obtaining optimal matching rules [25]. That work was highly specialized for the domain of census records and used hand-tuned similarity measures.
  • Hern´andez and Stolfo [10] developed the sorted neighborhood method for limiting the number of potential duplicate pairs that require distance computation, while McCallum et. al. proposed the canopies clustering algorithm [13] for the task of matching scientific citations. Monge and Elkan developed the iterative merging algorithm based on the union-find data structure [15] and showed the advantages of using a string distance metric that allows gaps [14]. Cohen et. al. [3] posed the duplicate detection task as an optimization problem, proved NP-hardness of solving the problem optimally, and proposed a nearly linear algorithm for finding a local optimum using the union-find data structure.
  • In recent work, Cohen and Richman have proposed an adaptive framework for duplicate detection that combines multiple similarity metrics [4]. Sarawagi and Bhamidipaty [22] and Tejada et. al. [23] developed systems that employ committee-based active learning methods for selecting record pairs that are informative for training the record-level classifier that combines similarity estimates from multiple fields across different metrics. In all of these approaches fixed-cost similarity metrics were used to compare the database records. We have shown that learnable similarity measures can be combined with trainable record-level similarity, and active learning techniques from prior work can be easily extended to include the distance measures that we proposed.

6. Future Work

  • We have proposed a general framework for using learnable string similarity measures in duplicate detection, and provided two algorithms for character-based and vector-space based text distances. There are several directions in which this approach can be extended. The general classification-based framework for computing vectorspace similarity can be improved by modifying the SVM training algorithm to avoid the overfitting issues that we encountered. The algorithm based on iterative decomposition of the quadratic optimization problem used by SVMlight needs to be extended to be robust to differences between distributions of test data and training data. This task is similar to transduction [24, 12] because it would require using unlabeled test data in the learning process, but with the fundamental departure from transduction in using unlabeled test data from a different distribution. Alternatively, the task of learning vector-space similarity between pairs of strings can be formalized as a parameter estimation or an optimization problem, and investigating statistical or mathematical programming methods that would incorporate regularization to deal with the distribution problem is a promising avenue for improvement.
  • Another area for future work lies in generalizing edit distance to include macro-operators for inserting and deleting common substrings, e.g. deleting “Street” in address fields. The string distance model with gaps would be particularly useful for this task, since it would allow discovering useful deletion sequences by developing a stochastic model based on the gaps created when computing minimum-cost alignments. Substructure discovery methods [5] could also be used to identify useful edit operation sequences that include different edit operations.

7. Conclusions

  • Duplicate detection is an important problem in data cleaning, and an adaptive approach that learns to identify duplicate records for a specific domain has clear advantages over static methods. Experimental results demonstrate that trainable similarity measures are capable of learning the specific notion of similarity that is appropriate for a specific domain. We presented two learnable distance measures that improve over character-based and vector-space based metrics and allow specializing them for specific datasets using labeled examples. We have also shown that support vector machines can be effectively utilized for some datasets both for string similarity and record similarity computations, outperforming traditional methods; we hope to improve on these initial results in our future work. Our overall framework for duplicate detection integrates previous work on adaptive methods with learnable similarity measures, leading to improved results.

References

  • Ricardo A. Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1999
  • M. Bilenko and Raymond Mooney. Learning to combine trained distance metrics for duplicate detection in databases. Technical Report AI 02-296, Artificial Intelligence Laboratory, University of Texas at Austin, Austin, TX, Feb. 2002.
  • William W. Cohen, Henry Kautz, David McAllester, Hardening soft information sources, Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.255-259, August 20-23, 2000, Boston, Massachusetts, United States doi:10.1145/347090.347141
  • William W. Cohen, Jacob Richman, Learning to match and cluster large high-dimensional data sets for data integration, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada doi:10.1145/775047.775116
  • D. J. Cook and L. B. Holder. Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1:231--255, 1994.
  • R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.
  • I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183--1210, 1969.
  • Yoav Freund, Llew Mason, The Alternating Decision Tree Learning Algorithm, Proceedings of the Sixteenth International Conference on Machine Learning, p.124-133, June 27-30, 1999
  • Dan Gusfield, Algorithms on strings, trees, and sequences: computer science and computational biology, Cambridge University Press, New York, NY, 1997
  • Mauricio A. Hernández, Salvatore J. Stolfo, The merge/purge problem for large databases, Proceedings of the 1995 ACM SIGMOD Conference, p.127-138, May 22-25, 1995, San Jose, California, United States
  • 11 Thorsten Joachims, Making large-scale support vector machine learning practical, Advances in kernel methods: support vector learning, MIT Press, Cambridge, MA, 1999
  • 12 (Joachims, 1999) ⇒ Thorsten Joachims. (1999). “Transductive Inference for Text Classification Using Support Vector Machines.” In: Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999).
  • 13 Andrew McCallum, Kamal Nigam, Lyle H. Ungar, Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching, Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States doi:10.1145/347090.347123
  • Alvaro E. Monge and C. Elkan. The field matching problem: Algorithms and applications. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 267--270, Portland, OR, Aug. 1996.
  • Alvaro E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the SIGMOD 1997 Workshop on Research Issues on Data Mining and Knowledge Discovery, pages 23--29, Tuscon, AZ, May 1997.
  • U. Y. Nahm and Raymond Mooney. Using information extraction to aid the discovery of prediction rules from texts. In: Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, Boston, MA, Aug. 2000.
  • S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequences of two proteins. Journal of Molecular Biology, 48:443--453, 1970.
  • H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954--959, 1959.
  • J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In A. J. Smola, P. Bartlett, Bernhard Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 185--208. MIT Press, 1999.
  • Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--286, 1989.
  • Eric Sven Ristad, Peter N. Yianilos, Learning String-Edit Distance, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.20 n.5, p.522-532, May 1998 doi:10.1109/34.682181
  • Sunita Sarawagi, Anuradha Bhamidipaty, Interactive deduplication using active learning, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada doi:10.1145/775047.775087
  • Sheila Tejada, Craig A. Knoblock, Steven Minton, Learning domain-independent string transformation weights for high accuracy object identification, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada doi:10.1145/775047.775099
  • Vladimir N. Vapnik. Statistical Learning Theory. Wiley, 1998.
  • W. E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Wachington, DC, 1999.
  • Ian H. Witten, Eibe Frank, Data mining: Practical Machine Learning Tools and Techniques with Java implementations, Morgan Kaufmann Publishers Inc., San Francisco, CA, 2000
  • Bianca Zadrozny, Charles Elkan, Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers, Proceedings of the Eighteenth International Conference on Machine Learning, p.609-616, June 28-July 01,,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2003 AdaptiveDuplicateDetectionMikhail Bilenko
Raymond J. Mooney
Adaptive Duplicate Detection Using Learnable String Similarity MeasuresProceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mininghttp://www.cs.utexas.edu/users/mbilenko/papers/03-marlin-kdd.pdf10.1145/956750.9567592003