2007 DuplicateRecordDetectionASurvey

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Record Coreference Resolution Task, Record Coreference Resolution Algorithm, Survey Paper.

Notes

Cited By

2011

Quotes

Abstract

Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats or any combination of these factors. In this article, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of ppapproximate duplicate detection algorithm]]s. We conclude with a coverage of existing tools and with a brief discussion of the big open problems in the area.

1. Introduction

Databases play an important role in today's IT based economy. Many industries and systems depend on the accuracy of databases to carry out operations. Therefore, the quality of the information (or the lack thereof) stored in the databases, can have significant cost implications to a system that relies on information to function and conduct business. In an error-free system with perfectly clean data, the construction of a comprehensive view of the data consists of linking ?in relational terms, joining? two or more tables on their key fields. Unfortunately, data often lack a unique, global identifier that would permit such an operation. Furthermore, the data are neither carefully controlled for quality nor defined in a consistent way across different data sources. Thus, data quality is often compromised by many factors, including data entry errors (e.g., Microsft instead of Microsoft), missing integrity constraints (e.g., allowing entries such as EmployeeAge=567), and multiple conventions for recording information (e.g., 44 W. 4th St. vs. 44 West Fourth Street). To make things worse, in independently managed databases not only the values, but the structure, semantics and underlying assumptions about the data may differ as well.

Often, while integrating data from different sources to implement a data warehouse, organizations become aware of potential systematic differences or con icts. Such problems fall under the umbrella-term data heterogeneity [14]. Data cleaning [77], or data scrubbing [96], refer to the process of resolving such identification problems in the data. We distinguish between two types of data heterogeneity: structural and lexical. Structural heterogeneity occurs when the fields of the tuples in the database are structured differently in different databases. For example, in one database, the customer address might be recorded in one field named, say, addr, while in another database the same information might be stored in multiple fields such as street, city, state, and zipcode. Lexical heterogeneity occurs when the tuples have identically structured fields across databases, but the data use different representations to refer to the same real-world object (e.g., StreetAddress=44 W. 4th St. vs. StreetAddress=44 West Fourth Street).

In this paper, we focus on the problem of lexical heterogeneity and survey various techniques which have been developed for addressing this problem. We focus on the case where the input is a set of structured and properly segmented records, i.e., we focus mainly on cases of database records. Hence, we do not cover solutions for the various other problems, such that of mirror detection, in which the goal is to detect similar or identical web pages (e.g., see [13], [18]). Also, we do not cover solutions for problems such as anaphora resolution [56], in which the problem is to locate different mentions of the same entity in free text (e.g., that the phrase ?President of the U.S.? refers to the same entity as ?George W. Bush?). We should note that the algorithms developed for mirror detection or for anaphora resolution are often applicable for the task of duplicate detection. Techniques for mirror detection have been used for detection of duplicate database records (see, for example, Section V-A.4) and techniques for anaphora resolution are commonly used as an integral part of deduplication in relations that are extracted from free text using information extraction systems [52].

The problem that we study has been known for more than five decades as the record linkage or the record matching problem [31], [61]?[64], [88] in the statistics community. The goal of record matching is to identify records in the same or different databases that refer to the same real-world entity, even if the records are not identical. In slightly ironic fashion, the same problem has multiple names across research communities. In the database community, the problem is described as merge-purge [39], data deduplication [78], and instance identification [94]; in the AI community, the same problem is described as database hardening [21] and name matching [9]. The names coreference resolution, identity uncertainty, and duplicate detection are also commonly used to refer to the same task. We will use the term duplicate record detection in this paper.

The remaining part of this paper is organized as follows: In Section II, we brie y discuss the necessary steps in the data cleaning process, before the duplicate record detection phase. Then, Section III describes techniques used to match individual fields, and Section IV presents techniques for matching records that contain multiple fields. Section V describes methods for improving the efficiency of the duplicate record detection process and Section VI presents a few commercial, off-the-shelf tools used in industry for duplicate record detection and for evaluating the initial quality of the data and of the matched records. Finally, Section VII concludes the paper and discusses interesting directions for future research.

2. Data Preparation

...

VII. Future Directions and Conclusion

In this survey, we have presented a comprehensive survey of the existing techniques used for detecting non-identical duplicate entries in database records. The interested reader may also want to read a complementary survey by Winkler [100] and the Special Issue of the IEEE Data Engineering Bulletin on Data Quality [45].

As database systems are becoming more and more commonplace, data cleaning is going to be the cornerstone for correcting errors in systems which are accumulating vast amounts of errors on a daily basis. Despite the breadth and depth of the presented techniques, we believe that there is still room for substantial improvements in the current state-of-the-art.

First of all, it is currently unclear which metrics and techniques are the current state-of-the-art. The lack of standardized, large scale benchmarking data sets can be a big obstacle for the further development of the field, as it is almost impossible to convincingly compare new techniques with existing ones. A repository of benchmark data sources with known and diverse characteristics should be made available to developers so they may evaluate their methods during the development process. Along with benchmark and evaluation data, various systems need some form of training data to produce the initial matching model. Although small data sets are available, we are not aware of large-scale, validated data sets that could be used as benchmarks. Winkler [98] highlights techniques on how to derive data sets that are properly anonymized and are still useful for duplicate record detection purposes. * Currently, there are two main approaches for duplicate record detection. Research in databases emphasizes relatively simple and fast duplicate detection techniques, that can be applied to databases with millions of records. Such techniques typically do not rely on the existence of training data, and emphasize efficiency over effectiveness. On the other hand, research in machine learning and statistics aims to develop more sophisticated matching techniques that rely on probabilistic models. An interesting direction for future research is to develop techniques that combine the best of both worlds.

Most of the duplicate detection systems available today offer various algorithmic approaches for speeding up the duplicate detection process. The changing nature of the duplicate detection process also requires adaptive methods that detect different patterns for duplicate detection and automatically adapt themselves over time. For example, a background process could monitor the current data, incoming data and any data sources that need to be merged or matched, and decide, based on the observed errors, whether a revision of the duplicate detection process is necessary or not. Another related aspect of this challenge is to develop methods that permit the user to derive the proportions of errors expected in data cleaning projects. Finally, large amounts of structured information is now derived from unstructured text and from the web. This information is typically imprecise and noisy; duplicate record detection techniques are crucial for improving the quality of the extracted data. The increasing popularity of information extraction techniques is going to make this issue more prevalent in the future, highlighting the need to develop robust and scalable solutions. This only adds to the sentiment that more research is needed in the area of duplicate record detection and in the area of data cleaning and information quality in general.

References

  • Eugene Agichtein and Venkatesh Ganti. Mining reference tables for automatic text segmentation. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), pages 20?29, (2004).
  • Rakesh Agrawal and Ramakrishnan Srikant. Searching with numbers. In: Proceedings of the 11th International World Wide Web Conference (WWW11), pages 420?431, (2002).
  • Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Network Flows: Theory, Algorithms, and Applications. Prentice Hall, 1st edition, February (1993).
  • Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Meyers, and David J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403?410, October 1990.
  • Rohit Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti. Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th International Conference on Very Large Databases (VLDB 2002), (2002).
  • Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering. Machine Learning, 56(1-3):89?113, (2004).
  • Rohan Baxter, Peter Christen, and Tim Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD '03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 25?27, (2003).
  • Indrajit Bhattacharya and Lise Getoor. Latent dirichlet allocation model for entity resolution. Technical Report CS-TR-4740, Computer Science Department, University of Maryland, August (2005).
  • Mikhail Bilenko, Raymond Mooney, William Weston Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):16?23, September/October (2003).
  • Avrim Blum and Tom M. Mitchell. Combining Labeled and Unlabeled Data with Co-training. In COLT' 98: Proceedings of the eleventh annual conference on Computational learning theory, pages 92?100, (1998).
  • Vinayak R. Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic segmentation of text into structured records. In: Proceedings of the 2001 ACM SIGMOD Conference (SIGMOD 2001), pages 175?186, (2001).
  • Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. CRC Press, July (1984).
  • Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. In: Proceedings of the Sixth International World Wide Web Conference (WWW6), pages 1157?1166, (1997).
  • Abhirup Chatterjee and Arie Segev. Data manipulation in heterogeneous databases. ACM SIGMOD Record, 20(4):64?68, December (1991).
  • Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani. Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD Conference (SIGMOD 2003), pages 313?324, (2003).
  • Surajit Chaudhuri, Venkatesh Ganti, and Rajeev Motwani. Robust identification of fuzzy duplicates. In: Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE 2005), pages 865?876, (2005).
  • Peter Cheeseman and John Sturz. Bayesian classification (autoclass): Theory and results. In Advances in knowledge discovery and data mining, pages 153?180. AAAI Press/The MIT Press, (1996).
  • Junghoo Cho, Narayanan Shivakumar, and Hector Garcia-Molina. Finding replicated web collections. In: Proceedings of the 2000 ACM SIGMOD Conference (SIGMOD 2000), pages 355?366, (2000).
  • Munir Cochinwala, Verghese Kurien, Gail Lalk, and Dennis Shasha. Efficient data reconciliation. Information Sciences, 137(1-4):1?15, September (2001).
  • William W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3):288?321, (2000).
  • William W. Cohen, Henry Kautz, and David McAllester. Hardening soft information sources. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), pages 255?259, (2000).
  • William Weston Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the 1998 ACM SIGMOD Conference (SIGMOD 1998), pages 201?212, (1998).
  • William Weston Cohen and Jacob Richman. Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), (2002).
  • [24] David A. Cohn, Les Atlas, and Richard E. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201?221, (1994).
  • [25] Tamraparni Dasu, Theodore Johnson, Shanmugauelayut Muthukrishnan, and Vladislav Shkapenyuk. Mining database structure; or, how to build a data quality browser. In: Proceedings of the 2002 ACM SIGMOD Conference (SIGMOD 2002), pages 240?251, (2002).
  • Arthur Pentland Dempster, Nan McKenzie Laird, and Donald Bruce Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B(39):1?38, 1977.
  • Debabrata Dey, Sumit Sarkar, and Prabuddha De. Entity matching in heterogeneous databases: A distance based decision model. In 31st Annual Hawaii International Conference on System Sciences (HICSS'98), pages 305?313, (1998).
  • Nelson S. D'Andrea Du Bois, Jr. A solution to the problem of linking multivariate documents. Journal of the American Statistical Association, 64(325):163?174, March 1969.
  • Richard Oswald Duda and Peter Elliot Hart. Pattern Classification and Scene Analysis. Wiley, 1973.
  • Mohamed G. Elfeky, Ahmed K. Elmagarmid, and Vassilios S. Verykios. TAILOR: A record linkage tool box. In: Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE 2002), pages 17?28, (2002).
  • Ivan Peter Fellegi and Alan B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183?1210, December 1969.
  • Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian-Augustin Saita. Declarative data cleaning: Language, model, and algorithms. In: Proceedings of the 27th International Conference on Very Large Databases (VLDB 2001), pages 371?380, (2001).
  • Leicester E. Gill. OX-LINK: The Oxford medical record linkage system. In: Proceedings of the International Record Linkage Workshop and Exposition, pages 15?33, (1997).
  • Luis Gravano, Panagiotis G. Ipeirotis, Hosagrahar Visvesvaraya Jagadish, Nick Koudas, Shanmugauelayut Muthukrishnan, Lauri Pietarinen, and Divesh Srivastava. Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4):28?34, December (2001).
  • Luis Gravano, Panagiotis G. Ipeirotis, Hosagrahar Visvesvaraya Jagadish, Nick Koudas, Shanmugauelayut Muthukrishnan, and Divesh Srivastava. Approximate string joins in a database (almost) for free. In: Proceedings of the 27th International Conference on Very Large Databases (VLDB 2001), pages 491?500, (2001).
  • Luis Gravano, Panagiotis G. Ipeirotis, Nick Koudas, and Divesh Srivastava. Text joins in an RDBMS for web data integration. In: Proceedings of the 12th International World Wide Web Conference (WWW12), pages 90?101, (2003).
  • Sudipto Guha, Nick Koudas, Amit Marathe, and Divesh Srivastava. Merging the results of approximate match operations. In: Proceedings of the 30th International Conference on Very Large Databases (VLDB 2004), pages 636?647, (2004).
  • Trevor Hastie, Robert Tibshirani, and Jerome Harold Friedman. The Elements of Statistical Learning. Springer Verlag, August (2001).
  • Mauricio Antonio Hern´andez and Salvatore Joseph Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9?37, January (1998).
  • Matthew A. Jaro. Unimatch: A record linkage system: User's manual. Technical report, U.S. Bureau of the Census, Washington, D.C., 1976.
  • Matthew A. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84(406):414?420, June (1989).
  • Thorsten Joachims. Making large-scale SVM learning practical. In Bernhard Schölkopf, Christopher John C. Burges, and Alexander J. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT-Press, (1999).
  • Ralph Kimball and Joe Caserta. The data warehouse ETL toolkit: Practical techniques for extracting, cleaning, conforming, and delivering data. John Wiley & Sons, (2004).
  • Daphne Koller and Mehran Sahami. Hierarchically classifying documents using very few words. In: Proceedings of the 14th International Conference on Machine Learning (ICML'97), pages 170?178, (1997).
  • Nick Koudas, editor. IEEE Data Engineering Bulletin, volume 29. IEEE, June (2006). Special Issue on Data Quality.
  • Nick Koudas, Amit Marathe, and Divesh Srivastava. Flexible string matching against large databases in practice. In: Proceedings of the 30th International Conference on Very Large Databases (VLDB 2004), pages 1078?1086, (2004).
  • Karen Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4):377?439, December (1992).
  • Gad M. Landau and Uzi Vishkin. Fast parallel and serial approximate string matching. Journal of Algorithms, 10(2):157? 169, June (1989).
  • Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR, 163(4):845?848, 1965. Original in Russian ? translation in Soviet Physics Doklady 10(8):707-710, 1966.
  • Ee-Peng Lim, Jaideep Srivastava, Satya Prabhakar, and James Richardson. Entity identification in database integration. In: Proceedings of the Ninth IEEE International Conference on Data Engineering (ICDE 1993), pages 294?301, (1993).
  • Nikos Mamoulis. Efficient processing of joins on set-valued attributes. In: Proceedings of the 2003 ACM SIGMOD Conference (SIGMOD 2003), pages 157?168, (2003).
  • Andrew McCallum. Information extraction: Distilling structured data from unstructured text. ACM Queue, 3(9):48?57, (2005).
  • Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. Maximum entropy markov models for information extraction and segmentation. In: Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pages 591?598, (2000).
  • Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), pages 169?178, (2000).
  • Andrew McCallum and Ben Wellner. Conditional models of identity uncertainty with application to noun coreference. In Advances in Neural Information Processing Systems (NIPS 2004), (2004).
  • Ruslan Mitkov. Anaphora Resolution. Longman, 1st edition, August (2002).
  • Alvaro E. Monge and Charles P. Elkan. The field matching problem: Algorithms and applications. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 267?270, (1996).
  • Alvaro E. Monge and Charles P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the 2nd ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD'97), pages 23?29, (1997).
  • Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31?88, (2001).
  • Saul Ben Needleman and Christian Dennis Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443?453, March 1970.
  • Howard B. Newcombe. Record linking: The design of efficient systems for linking records into individual and family histories. American Journal of Human Genetics, 19(3):335?359, May 1967.
  • Howard B. Newcombe. Handbook of Record Linkage. Oxford University Press, (1988).
  • Howard B. Newcombe and James M. Kennedy. Record linkage: Making maximum use of the discriminating power of identifying information. Communications of the ACM, 5(11):563?566, November 1962.
  • Howard B. Newcombe, James M. Kennedy, S.J. Axford, and A.P. James. Automatic linkage of vital records. Science, 130(3381):954?959, October 1959.
  • Kamal Nigam, Andrew McCallum, Sebastian Thrun, and Tom M. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103?134, (2000).
  • Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart J. Russell, and Ilya Shpitser. Identity uncertainty and citation matching. In Advances in Neural Information Processing Systems (NIPS 2002), pages 1401?1408, (2002).
  • Mike Perkowitz, Robert B. Doorenbos, Oren Etzioni, and Daniel Sabey Weld. Learning to understand information on the Internet: An example-based approach. Journal of Intelligent Information Systems, 8(2):133?153, March (1997).
  • Lawrence Philips. Hanging on the metaphone. Computer Language Magazine, 7(12):39?44, December (1990). Accessible at http://www.cuj.com/documents/s=8038/cuj0006philips/.
  • Lawrence Philips. The double metaphone search algorithm. C/C++ Users Journal, 18(5), June (2000).
  • Jose C. Pinheiro and Don X. Sun. Methods for linking and mining heterogeneous databases. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD-98), pages 309?313, (1998).
  • Vijayshankar Raman and Joseph M. Hellerstein. Potter's wheel: An interactive data cleaning system. In: Proceedings of the 27th International Conference on Very Large Databases (VLDB 2001), pages 381?390, (2001).
  • Pradeep Ravikumar and William Weston Cohen. A hierarchical graphical model for record linkage. In 20th Conference on Uncertainty in Artificial Intelligence (UAI 2004), (2004).
  • Eric Sven Ristad and Peter N. Yianilos. Learning string edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5):522?532, May (1998).
  • Elke Rundensteiner, editor. IEEE Data Engineering Bulletin, volume 22. IEEE, January (1999). Special Issue on Data Transformation.
  • Robert C. Russell. Index, U.S. patent 1,261,167. Available at http://patft.uspto.gov/netahtml/srchnum.htm, April 1918.
  • Robert C. Russell. Index, U.S. patent 1,435,663. Available at http://patft.uspto.gov/netahtml/srchnum.htm, November 1922.
  • Sunita Sarawagi, editor. IEEE Data Engineering Bulletin, volume 23. IEEE, December (2000). Special Issue on Data Cleaning.
  • Sunita Sarawagi and Anuradha Bhamidipaty. Interactive deduplication using active learning. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), pages 269?278, (2002).
  • Sunita Sarawagi and Alok Kirpal. Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD Conference (SIGMOD 2004), pages 743?754, (2004).
  • Parag Singla and Pedro Domingos. Multi-relational record linkage. In KDD-2004 Workshop on Multi-Relational Data Mining, pages 31?48, (2004).
  • Temple F. Smith and Michael S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195?197, (1981).
  • Aya Soffer, David Carmel, Doron Cohen, Ronald Fagin, Eitan Farchi, Michael Herscovici, and Yoelle S. Maarek. Static index pruning for information retrieval systems. In: Proceedings of the 24th ACM SIGIR Conference Retrieval, SIGIR 2001, pages 43?50, (2001).
  • Erkki Sutinen and Jorma Tarhio. On using q-gram locations in approximate string matching. In: Proceedings of Third Annual European Symposium on Algorithms (ESA'95), pages 327?340, (1995).
  • Charles Sutton, Khashayar Rohanimanesh, and Andrew McCallum. Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. In: Proceedings of the 21st International Conference on Machine Learning (ICML 2004), (2004).
  • Robert L. Taft. Name search techniques. Technical Report Special Report No. 1, New York State Identification and Intelligence System, Albany, NY, February 1970.
  • Sheila Tejada, Craig A. Knoblock, and Steven Minton. Learning object identification rules for information integration. Information Systems, 26(8):607?633, (2001).
  • Sheila Tejada, Craig Alan Knoblock, and Steven Minton. Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), (2002).
  • Benjamin J. Tepping. A model for optimum linkage of records. Journal of the American Statistical Association, 63(324):1321?1332, December 1968.
  • Esko Ukkonen. Approximate string matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191?211, (1992).
  • Julian R. Ullmann. A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. The Computer Journal, 20(2):141?147, 1977.
  • Vassilios S. Verykios, Ahmed K. Elmagarmid, and Elias N. Houstis. Automating the approximate record matching process. Information Sciences, 126(1-4):83?98, July (2000).
  • Vassilios S. Verykios and George V. Moustakides. A generalized cost optimal decision model for record matching. In: Proceedings of the 2004 International Workshop on Information Quality in Information Systems, pages 20?26, (2004).
  • Vassilios S. Verykios, George V. Moustakides, and Mohamed G. Elfeky. A bayesian decision model for cost optimal record matching. VLDB Journal, 12(1):28?40, May (2003).
  • Y. Richard Wang and Stuart E. Madnick. The inter-database instance identification problem in integrating autonomous systems. In: Proceedings of the Fifth IEEE International Conference on Data Engineering (ICDE 1989), pages 46?55, (1989).
  • Michael S. Waterman, Temple F. Smith, and William A. Beyer. Some biological sequence metrics. Advances in Mathematics, 20(4):367?387, 1976.
  • Jennifer Widom. Research problems in data warehousing. In: Proceedings of the 1995 ACM Conference on Information and Knowledge Management (CIKM'95), pages 25?30, (1995).
  • William E. Winkler. Improved decision rules in the Felligi-Sunter model of record linkage. Technical Report Statistical Research Report Series RR93/12, U.S. Bureau of the Census, Washington, D.C., (1993).
  • William E. Winkler. The state of record linkage and current research problems. Technical Report Statistical Research Report Series RR99/04, U.S. Bureau of the Census, Washington, D.C., (1999).
  • William E. Winkler. Methods for record linkage and bayesian networks. Technical Report Statistical Research Report Series RRS2002/05, U.S. Bureau of the Census, Washington, D.C., 2002.
  • William E. Winkler. Overview of record linkage and current research directions. Technical Report Statistical Research Report Series RRS2006/02, U.S. Bureau of the Census, Washington, D.C., (2006).
  • William E. Winkler and Yves Thibaudeau. An application of the Fellegi-Sunter model of record linkage to the 1990 U.S. decennial census. Technical Report Statistical Research Report Series RR91/09, U.S. Bureau of the Census, Washington, D.C., (1991).
  • William E. Yancey. Bigmatch: A program for extracting probable matches from a large file for record linkage. Technical Report Statistical Research Report Series RRC2002/01, U.S. Bureau of the Census, Washington, D.C., March (2002).
  • William E. Yancey. Evaluating string comparator performance for record linkage. Technical Report Statistical Research Report Series RRS2005/05, U.S. Bureau of the Census, Washington, D.C., June (2005).
  • Justin Zobel, Alistair Moffat, and Kotagiri Ramamohanarao. Inverted files versus signature files for text indexing. ACM Transactions on Database Systems, 23(4):453?490, December (1998).

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 DuplicateRecordDetectionASurveyAhmed K. Elmagarmid
Panagiotis G. Ipeirotis
Vassilios S. Verykio
Duplicate Record Detection: A Surveyhttp://dc-pubs.dbs.uni-leipzig.de/files/Elmagarmid2007DuplicateRecordDetectionASurvey.pdf10.1109/TKDE.2007.9