2003 AdaptiveDuplicateDetection

(Bilenko & Mooney, 2003a) ⇒ Mikhail Bilenko, Raymond Mooney. (2003). “Adaptive Duplicate Detection Using Learnable String Similarity Measures.” In: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003). doi:10.1145/956750.956759

Subject Headings(s): Duplicate Record Detection Algorithm.

Notes

related to (Bilenko and Mooney, 2003b) ⇒ Mikhail Bilenko and Raymond Mooney. (2003). “Employing Trainable String Similarity Metrics for Information Integration.” In: Proceedings of the IJCAI-2003 Workshop on Information Integration.

Cited By

~320 http://scholar.google.com/scholar?cites=11616015399668596221

2007

(Bhattacharya and Getoor, 2007) ⇒ Indrajit Bhattacharya and Lise Getoor. (2007). “Collective Entity Resolution in Relational Data.” In: Proceedings for ACM Transactions on Knowledge Discovery from Data ([[TKDD] 2007).

2005

(Kalashnikov et al., 2005) ⇒ Dmitri V. Kalashnikov, Sharad Mehrotra, and Zhaoqi Chen. (2005). “Exploiting Relationships for Domain-Independent Data Cleaning.” In: Proceedings of the SIAM International Conference on Data Mining (SIAM SDM 2005)

Quotes

Abstract

The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. We present two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance, and a novel vector-space based measure that employs a Support Vector Machine (SVM) for training. Experimental results on a range of datasets show that our framework can improve duplicate detection accuracy over traditional techniques.

Introduction

Databases frequently contain field-values and records that refer to the same entity but are not syntactically identical. Variations in representation can arise from typographical errors, misspellings, abbreviations, as well as integration of multiple data sources. Variations are particularly pronounced in data that is automatically extracted from unstructured or semi-structured documents or web pages [16, 3]. Such approximate duplicates can have many deleterious effects, including preventing data-mining algorithms from discovering important regularities. This problem is typically handled during a tedious manual data cleaning, or “de-duping”, process. Some previous work has addressed the problem of identifying duplicate records, where it was referred to as record linkage [18, 25], the merge/purge problem [10], duplicate detection [15, 22], hardening soft databases [3], reference matching [13], and entity name clustering and matching [4]. Typically, standard string similarity metrics such as edit distance [9] or vector-space cosine similarity [1] are used to determine whether two values or records are alike enough to be duplicates. Some more recent work [4, 22, 23] has investigated the use of pairing functions that combine multiple standard metrics.
Rather than hand-tuning a distance metric for each field, we propose to use trainable similarity measures that can be learned from small corpora of labeled examples, and thus adapt to different domains. We present two such string similarity measures. The first one utilizes the Expectation-Maximization (EM) algorithm for estimating the parameters of a generative model based on string edit distance with affine gaps. The other string similarity measure employs a Support Vector Machine (SVM) [24] to obtain a similarity estimate based on the vector-space model of text. The character-based distance is best suited for shorter strings with minor variations, while the measure based on vector-space representation is more appropriate for fields that contain longer strings with more global variations.
Our overall system, MARLIN (Multiply Adaptive Record Linkage with INduction), employs a two-level learning approach. First, string similarity measures are trained for every database field so that they can provide accurate estimates of string distance between values for that field. Next, a final predicate for detecting duplicate records is learned from similarity metrics applied to each of the individual fields. We again utilize Support Vector Machines for this task, and show that they outperform decision trees, the classifier used in prior work [23, 22]. We evaluate our approach on several real-world data sets containing duplicate records and show that MARLIN can lead to improved duplicate detection accuracy over traditional techniques.

Experimental Evaluation

Datasets

Our experiments were conducted on six datasets. Restaurant is a database of 864 restaurant names and addresses containing 112 duplicates obtained by integrating records from Fodor’s and Zagat’s guidebooks. Cora is a collection of 1295 distinct citations to 122 Computer Science research papers from the Cora Computer Science research paper search engine. The citations were segmented into multiple fields such as author, title, venue, etc. by an information extraction system, resulting in some crossover noise between the fields. Reasoning, Face, Reinforcement and Constraint are single-field datasets containing unsegmented citations to computer science papers in corresponding areas from the Citeseer scientific literature digital library.2 Reasoning contains 514 citation records that represent 196 unique papers, Face contains 349 citations to 242 papers, Reinforcement contains 406 citations to 148 papers, and Constraint contains 295 citations to 199 papers. Tables 1–3 contain sample duplicate records from the Restaurant, Cora, and Reasoning datasets.

5. Related Workd

The problem of identifying duplicate records in databases was originally identified by Newcombe [18] as record linkage in the context of identifying medical records of the same individual from different time periods. Fellegi and Sunter [7] developed a formal theory for record linkage and offered statistical methods for estimating matching parameters and error rates. In more recent work in statistics, Winkler proposed using EM-based methods for obtaining optimal matching rules [25]. That work was highly specialized for the domain of census records and used hand-tuned similarity measures.
Hern´andez and Stolfo [10] developed the sorted neighborhood method for limiting the number of potential duplicate pairs that require distance computation, while McCallum et. al. proposed the canopies clustering algorithm [13] for the task of matching scientific citations. Monge and Elkan developed the iterative merging algorithm based on the union-find data structure [15] and showed the advantages of using a string distance metric that allows gaps [14]. Cohen et. al. [3] posed the duplicate detection task as an optimization problem, proved NP-hardness of solving the problem optimally, and proposed a nearly linear algorithm for finding a local optimum using the union-find data structure.
In recent work, Cohen and Richman have proposed an adaptive framework for duplicate detection that combines multiple similarity metrics [4]. Sarawagi and Bhamidipaty [22] and Tejada et. al. [23] developed systems that employ committee-based active learning methods for selecting record pairs that are informative for training the record-level classifier that combines similarity estimates from multiple fields across different metrics. In all of these approaches fixed-cost similarity metrics were used to compare the database records. We have shown that learnable similarity measures can be combined with trainable record-level similarity, and active learning techniques from prior work can be easily extended to include the distance measures that we proposed.

6. Future Work

We have proposed a general framework for using learnable string similarity measures in duplicate detection, and provided two algorithms for character-based and vector-space based text distances. There are several directions in which this approach can be extended. The general classification-based framework for computing vectorspace similarity can be improved by modifying the SVM training algorithm to avoid the overfitting issues that we encountered. The algorithm based on iterative decomposition of the quadratic optimization problem used by SVMlight needs to be extended to be robust to differences between distributions of test data and training data. This task is similar to transduction [24, 12] because it would require using unlabeled test data in the learning process, but with the fundamental departure from transduction in using unlabeled test data from a different distribution. Alternatively, the task of learning vector-space similarity between pairs of strings can be formalized as a parameter estimation or an optimization problem, and investigating statistical or mathematical programming methods that would incorporate regularization to deal with the distribution problem is a promising avenue for improvement.
Another area for future work lies in generalizing edit distance to include macro-operators for inserting and deleting common substrings, e.g. deleting “Street” in address fields. The string distance model with gaps would be particularly useful for this task, since it would allow discovering useful deletion sequences by developing a stochastic model based on the gaps created when computing minimum-cost alignments. Substructure discovery methods [5] could also be used to identify useful edit operation sequences that include different edit operations.

7. Conclusions

Duplicate detection is an important problem in data cleaning, and an adaptive approach that learns to identify duplicate records for a specific domain has clear advantages over static methods. Experimental results demonstrate that trainable similarity measures are capable of learning the specific notion of similarity that is appropriate for a specific domain. We presented two learnable distance measures that improve over character-based and vector-space based metrics and allow specializing them for specific datasets using labeled examples. We have also shown that support vector machines can be effectively utilized for some datasets both for string similarity and record similarity computations, outperforming traditional methods; we hope to improve on these initial results in our future work. Our overall framework for duplicate detection integrates previous work on adaptive methods with learnable similarity measures, leading to improved results.

References

Ricardo A. Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1999
M. Bilenko and Raymond Mooney. Learning to combine trained distance metrics for duplicate detection in databases. Technical Report AI 02-296, Artificial Intelligence Laboratory, University of Texas at Austin, Austin, TX, Feb. 2002.
William W. Cohen, Henry Kautz, David McAllester, Hardening soft information sources, Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.255-259, August 20-23, 2000, Boston, Massachusetts, United States doi:10.1145/347090.347141
William W. Cohen, Jacob Richman, Learning to match and cluster large high-dimensional data sets for data integration, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada doi:10.1145/775047.775116
D. J. Cook and L. B. Holder. Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1:231--255, 1994.
R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183--1210, 1969.
Yoav Freund, Llew Mason, The Alternating Decision Tree Learning Algorithm, Proceedings of the Sixteenth International Conference on Machine Learning, p.124-133, June 27-30, 1999
Dan Gusfield, Algorithms on strings, trees, and sequences: computer science and computational biology, Cambridge University Press, New York, NY, 1997
Mauricio A. Hernández, Salvatore J. Stolfo, The merge/purge problem for large databases, Proceedings of the 1995 ACM SIGMOD Conference, p.127-138, May 22-25, 1995, San Jose, California, United States
11 Thorsten Joachims, Making large-scale support vector machine learning practical, Advances in kernel methods: support vector learning, MIT Press, Cambridge, MA, 1999
12 (Joachims, 1999) ⇒ Thorsten Joachims. (1999). “Transductive Inference for Text Classification Using Support Vector Machines.” In: Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999).
13 Andrew McCallum, Kamal Nigam, Lyle H. Ungar, Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching, Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States doi:10.1145/347090.347123
Alvaro E. Monge and C. Elkan. The field matching problem: Algorithms and applications. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 267--270, Portland, OR, Aug. 1996.
Alvaro E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the SIGMOD 1997 Workshop on Research Issues on Data Mining and Knowledge Discovery, pages 23--29, Tuscon, AZ, May 1997.
U. Y. Nahm and Raymond Mooney. Using information extraction to aid the discovery of prediction rules from texts. In: Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, Boston, MA, Aug. 2000.
S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequences of two proteins. Journal of Molecular Biology, 48:443--453, 1970.
H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954--959, 1959.
J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In A. J. Smola, P. Bartlett, Bernhard Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 185--208. MIT Press, 1999.
Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--286, 1989.
Eric Sven Ristad, Peter N. Yianilos, Learning String-Edit Distance, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.20 n.5, p.522-532, May 1998 doi:10.1109/34.682181
Sunita Sarawagi, Anuradha Bhamidipaty, Interactive deduplication using active learning, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada doi:10.1145/775047.775087
Sheila Tejada, Craig A. Knoblock, Steven Minton, Learning domain-independent string transformation weights for high accuracy object identification, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada doi:10.1145/775047.775099
Vladimir N. Vapnik. Statistical Learning Theory. Wiley, 1998.
W. E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Wachington, DC, 1999.
Ian H. Witten, Eibe Frank, Data mining: Practical Machine Learning Tools and Techniques with Java implementations, Morgan Kaufmann Publishers Inc., San Francisco, CA, 2000
Bianca Zadrozny, Charles Elkan, Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers, Proceedings of the Eighteenth International Conference on Machine Learning, p.609-616, June 28-July 01,,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2003 AdaptiveDuplicateDetection	Mikhail Bilenko Raymond J. Mooney			Adaptive Duplicate Detection Using Learnable String Similarity Measures		Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining	http://www.cs.utexas.edu/users/mbilenko/papers/03-marlin-kdd.pdf	10.1145/956750.956759		2003

2003 AdaptiveDuplicateDetection

Notes

Cited By

2007

2005

Quotes

Abstract

Introduction

Experimental Evaluation

Datasets

5. Related Workd

6. Future Work

7. Conclusions

References

Navigation menu

Search