2002 InteractiveDeduplication

From GM-RKB
Jump to navigation Jump to search

Keyword(s): Entity Record Deduplication Algorithm, Citation Record Deduplication Task, Record Deduplication Function, Active Learning Algorithm.

Notes

Cited By

Quotes

Abstract

Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists. We present our design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning. Our experiments on real-life datasets show that active learning significantly reduces the number of instances needed to achieve high accuracy. We investigate various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.

Introduction

Our contribution
We designed a learning based deduplication system (ALIAS - Active Learning led Interactive Alias Suppression) that allows automatic construction of the deduplication function by using a novel method of interactively discovering challenging training pairs. Our key insight is to simultaneously build several redundant functions and exploit the disagreement amongst them to discover new kinds inconsistencies amongst duplicates in the dataset. Active learning [6, 8] methods also rely on a similar insight for selecting instances for labeling from a large pool of unlabeled instances. Unlike an ordinary learner that trains using a static training set, an active learner actively picks subsets of instances which when labeled will provide the highest information gain to the learner.

With this approach the more difficult task of bringing together the potentially confusing record pair s is automated by the learner. The user has to only perform the easy task of labeling the selected pairs of records as duplicate or not.

We designed an active learning algorithm that can meet our design goals of interactive response, fast convergence, and high accuracy. Finally, our system outputs a deduplication function that is easy to interpret and efficient to evaluate when deployed on large record lists. This required evaluating various non-obvious design tradeoffs that arise when using current active learning methods in a practical setting. Experiments on real-life datasets show that our approach reduces the number of labeled training pairs by two orders of magnitude to reach a certain accuracy. After labeling100 pairs selected interactively by our system, the learnt deduplication function can achieve the peak accuracy which a randomly chosen set of pairs cannot achieve even with 7000 pairs.

Overall architecture

Figure 1 shows the overall design of our ALIAS system for deduplication. There are three primary inputs to the system:

  1. . Database of records (D) The original set D of records in which duplicates need to be detected. The data has d attributes a l, . . . a d, each of which could be textual or numeric. The goal of the system is to find the subset of pairs in the cross-product D x D that can be labeled as duplicates.
  2. . Initial training pairs (L) An optional small(less than ten) seed L of training records arranged in pairs of duplicates or non-duplicates.
  3. . Similarity functions (.T) A set .~ of n I functions each of which computes a similarity match between two records rl, r2 based on any subset of d attributes. Examples of such functions are edit-distance, soundex, abbreviationmatch on text fields, and absolute difference for integer fields. Many of the common functions could be inbuilt and added by default based on the data type. However, it is impossible to totally obviate an expert's domain knowledge in designing specific matching functions. These functions can be coded in the native language of the system (C++ in our case) and loaded dynamically. The functions in the set can be highly redundant and unrelated to each other because finally our automated learner wil perform the nontrivial task of finding the right

References

  • 1 S. Argamon-Engelson and I. Dagan. Committee-based sample selection for probabilistic classifiers. Journal of Artificial Intelligence Research, 11:335--360, 1999.
  • 2 Vinayak Borkar, Kaustubh Deshmukh, Sunita Sarawagi, Automatic segmentation of text into structured records, Proceedings of the 2001 ACM SIGMOD Conference, p.175-186, May 21-24, 2001, Santa Barbara, California, United States
  • 3 Chris Buckley, Gerard M. Salton, James Allan, The effect of adding relevance information in a relevance feedback environment, Proceedings of the 17th ACM SIGIR Conference retrieval, p.292-300, July 03-06, 1994, Dublin, Ireland
  • 4 Christopher J.C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, v.2 n.2, p.121-167, June 1998 doi:10.1023/A:1009715923555
  • 5 Efficient Evaluation of Queries with Mining Predicates, Proceedings of the 18th International Conference on Data Engineering, p.529, February 26-March 01, 2002
  • 6 David Cohn, Les Atlas, Richard Ladner, Improving Generalization with Active Learning, Machine Learning, v.15 n.2, p.201-221, May 1994 doi:10.1023/A:1022673506211
  • 7 Ronan Collobert, Samy Bengio, SVMTorch: support vector machines for large-scale regression problems, The Journal of Machine Learning Research, 1, p.143-160, 9/1/2001
  • 8 Yoav Freund, H. Sebastian Seung, Eli Shamir, Naftali Tishby, Selective Sampling Using the Query by Committee Algorithm, Machine Learning, v.28 n.2-3, p.133-168, Aug./Sept. 1997
  • 9 Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Cristian-Augustin Saita, Declarative Data Cleaning: Language, Model, and Algorithms, Proceedings of the 27th International Conference on Very Large Data Bases, p.371-380, September 11-14, 2001
  • 10 Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
  • 11 Mauricio A. Hernández, Salvatore J. Stolfo, Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem, Data Mining and Knowledge Discovery, v.2 n.1, p.9-37, January 1998 doi:10.1023/A:1009761603038
  • 12 J. Hylton. Identifying and merging related bibliographic records. Master's thesis, MIT, 1996.
  • 13 Vijay S. Iyengar, Chidanand Apte, Tong Zhang, Active learning using adaptive resampling, Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.91-98, August 20-23, 2000, Boston, Massachusetts, United States doi:10.1145/347090.347110
  • 14 W. C. Jacob. Learning to match and cluster entity names. In ACM SIGIR' 01 Workshop on Mathematical/Formal Methods in Information Retrieval, 2001.
  • 15 Ron Kohavi, Dan Sommerfield, James Dougherty. (1996) Data Mining using MLC++, A Machine Learning Library in C++.” In: Proceedings of the 8th International Conference on Tools with Artificial Intelligence (ICTAI 1996).
  • 16 Steve Lawrence, C. Lee Giles, Kurt Bollacker, Digital Libraries and Autonomous Citation Indexing, Computer, v.32 n.6, p.67-71, June 1999 doi:10.1109/2.769447
  • 17 R. Liere and P. Tadepalli. Active learning with committees for text categorization. In: Proceedings of AAAI-97, 14th Conference of the American Association for Artificial Intelligence, pages 591--596, Providence, US, (1997). AAAI Press, Menlo Park, US.
  • 18 Andrew McCallum, K. Nigam, J. Reed, J. Rennie, and K. Seymore. Cora: Computer science research paper search engine, http://cora.whizbang.com/, 2000.
  • 19 (McCallum et al., 2000) ⇒ Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. (2000). “Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching.” In: Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2000).
  • 20 Andrew McCallum, Kamal Nigam, Employing EM and Pool-based Active Learning for Text Classification, Proceedings of the Fifteenth International Conference on Machine Learning, p.350-358, July 24-27, 1998
  • 21 Tom M. Mitchell, Machine Learning, McGraw-Hill Higher Education, 1997
  • 22 Alvaro E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.
  • 23 Gonzalo Navarro, A guided tour to approximate string matching, ACM Computing Surveys (CSUR), v.33 n.1, p.31-88, March 2001 doi:10.1145/375360.375365
  • 24 J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, 1993
  • 25 Vijayshankar Raman, Joseph M. Hellerstein, Potter's Wheel: An Interactive Data Cleaning System, Proceedings of the 27th International Conference on Very Large Data Bases, p.381-390, September 11-14, 2001
  • 26 Sunita Sarawagi, editor. IEEE Data Engineering special issue on Data Cleaning. http://www.research.microsoft, com/research/db/debull/A00dec/issue.htm, December 2000.
  • 27 Greg Schohn, David Cohn, Less is More: Active Learning with Support Vector Machines, Proceedings of the Seventeenth International Conference on Machine Learning, p.839-846, June 29-July 02, 2000
  • 28 H. S. Seung, M. Opper, H. Sompolinsky, Query by committee, Proceedings of the fifth annual workshop on Computational learning theory, p.287-294, July 27-29, 1992, Pittsburgh, Pennsylvania, United States doi:10.1145/130385.130417
  • 29 S. Toney. Cleanup and deduplication of an international deduplication function. Information Technology and libraries, 11(1):19--28, 1992.
  • 30 Simon Tong, Daphne Koller, Support vector machine active learning with applications to text classification, The Journal of Machine Learning Research, 2, p.45-66, 3/1/2002
  • 31 W. E. Winkler. Matching and record linkage. In B. G. C. et al., editor, Business Survey Methods, pages 355--384. New York: J. Wiley, (1995). available from http://www.census.gov/.
  • 32 W. E. Winkler. The state of record linkage and current research problems. RR99/04, http://www.census.gov/srd/papers/pdf/rr99-04.pdf, 1999.
  • 33 Bianca Zadrozny, Charles Elkan, Learning and making decisions when costs and probabilities are both unknown, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.204-213, August 26-29, 2001, San Francisco, California doi:10.1145/502512.502540
  • 34 Sarah Zelikovitz, Haym Hirsh, Improving Short-Text Classification using Unlabeled Data for Classification Problems, Proceedings of the Seventeenth International Conference on Machine Learning, p.1191-1198, June 29-July 02, 2000,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2002 InteractiveDeduplicationAnuradha Bhamidipaty
Sunita Sarawagi
Interactive Deduplication Using Active LearningProceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD 2002http://www.it.iitb.ac.in/~sunita/papers/kdd02.pdf10.1145/775047.7750872002