2006 OverviewOfRecordLinkageAndCurrentResDirs

(Winkler, 2006) ⇒ William E. Winkler. (2006). “Overview of record linkage and current research directions.” Technical Report Statistical Research Report Series RRS2006/02, U.S. Bureau of the Census.

Subject Headings: Entity Record Deduplication Task, Entity Record Deduplication Algorithm, Survey.

Notes

Uses the name Record Linkage.

Quotes

Abstract

This paper provides background on record linkage methods that can be used in combining data from a variety of sources such as person lists business lists. It also gives some areas of current research.

1. Introduction

Record linkage is the means of combining information from a variety of computerized files. It is also referred to as data cleaning (McCallum and Wellner 2003) or object identification (Tejada et al. 2002). The basic methods compare name and address information across pairs of files to determine those pairs of records that are associated with the same entity. An entity might be a business, a person, or some other type of unit that is listed. Based on economic relationships, straightforward extensions of methods might create functions and associated metrics for comparing information such as receipts or taxable income. The most sophisticated methods use information from multiple lists (Winkler 1999b), create new functional relationships between variables in two files that can be associated with new metrics for identifying corresponding entities (Scheuren and Winkler 1997), or use graph theoretic ideas for representing linkage relationships as conditional random fields that be partitioned into clusters representing individual entities (McCallum and Wellner 2003, Wei 2004).
The most basic application is identifying duplicates within a file or identifying duplicates across two files. If a single large file is considered, then the record linkage or matching procedures may be intended to identify duplicates. The duplicates can have the effect of erroneously inflating estimates of the number of entities in different categories. For instance, in a list of business, duplicates would inflate estimates in different industrial categories. The duplicates could also cause the number of individuals employed in a set of different firms to be overestimated. If a larger file is being updated using information from a more current but smaller file, then the smaller file is used to obtain records of new entities. The smaller file may contain information about firms or businesses in various categories such as finance or services that may be underrepresented in the larger file. In some situations, a combination of a large amount of duplication and undercoverage may cause severe errors in any uses of the list and the quantitative data that is associated with the list.
If a number of files are combined into a data warehouse, then Fayad and Uthurusamy (1996, 2002) and Fayad et al. (1996) have stated that the majority (possibly above 90%) of the work is associated with cleaning up the duplicates. Winkler (1995) has shown that computerized record linkage procedures can significantly reduce the resources needed for identifying duplicates in comparison with methods that are primarily manual. Newcombe and Smith (1975) have demonstrated the purely computerized duplicate detection in high quality person lists can often identify duplicates at greater level of accuracy than duplicate detection that involves a combination of computerized procedures and review by highly trained clerks. The reason is that the computerized procedures can make use of overall information from large parts of a list. For instance, the purely computerized procedure can make use of the relative rarity of various names and combinations of information in identifying duplicates. The relative rarity is computed as the files are being matched. Winkler (1995, 1999a) observed that the automated frequency-based (or value-specific) procedures could account for the relative rarity of a name such as ‘Martinez’ in cities such as Minneapolis, Minnesota in the US in comparison with the relatively high frequency of “Martinez’ in Los Angeles, California.
In a very large 1990 Decennial Census application, the computerized procedures were able to reduce the need for clerks and field follow-up from an estimated 3000 individuals over 3 months to 200 individuals over 6 weeks (Winkler 1995). The reason for the need for 200 clerks is that both first name and age were missing from a small proportion of Census forms and Post Enumeration Survey forms. If the clerks cannot supply the missing data from auxiliary information, then a field visit is often needed to get the missing information. The Post Enumeration Survey (PES) provided an independent re-enumeration of a large number of blocks (small Census regions) that corresponded to approximately 70 individuals. The PES was matched to the Census so that a capture-recapture methodology could be used to estimate both undercoverage and overcoverage to improve Census estimates. In the 1992 U.S. Census of Agriculture, the computerized procedures were able to reduce the clerical review from 75 clerks over 3 months to 6500 hours of review (Winkler 1995). The entities in the Census of Agriculture lists are farms corresponding to individuals, various types of partnerships of individuals, and corporations. Based on a large validation follow-up of the matching, the computerized procedures identified more duplicates automatically than the clerks were able to identify in the previous Census of Agriculture. The duplication in the final file was 2% in contrast to 10% in the previous final file from five years earlier.
In some situations, computerized record linkage can help preserve the confidentiality of information in a particular file or in a group of files. The record linkage procedures can delineate records that are at risk of disclosure. To deal with the disclosure risk, Sweeney (1999) has methods for reducing the identifying information in a group of records such as might be available in a hospital or large health network. Individuals with the highest access levels might have nearly full access to information in files. Individuals with lower access levels might automatically have only their access reduced to certain aggregate quantities associated with individual entities. In addition to removing identifiers such as name, social security number, doctor’s name, health plan name, all address information, quantities and values associated with certain types of treatments might also be aggregated. Iyengar (2002) provides disclosure-risk-reduction procedures that reduce identifying information in an optimal manner while still preserving specific analytic properties of the files.
Abowd and Woodcock (2002, 2004) have built large files of business entities that are combined with persons for the purpose of studying employment and labor dynamics. To allow use of the files by other economists, they used a combination of record linkage and micro-data confidentiality procedures to identify at risk records and mask data associated with them. Other economists can use the semi-confidential files to develop models and software that are used on the original confidential data. The results of the analyses such as tables and regression coefficients on the confidential data are given an additional review before publication. Abowd and Vilhuber (2005) have observed that the effect of very small amounts of error in identifiers can have small effects of some estimates and relatively large effects on other estimates. For instance, in a file of a billion (109) records representing quarterly employment in the State of California for twenty years, the number of erroneous social security numbers (SSNs) is approximately 1-2% per quarter. If the SSNs are not corrected using record linkage methods, then there would be approximately two breaks in every time-series associated with an individual. The breaks in the time-series can drastically affect the estimates of job creation, job loss, and employment rates that are based on the files.
The outline of this paper is as follows. The second section gives more background on the types of record linkage that are currently being applied. The third section provides details of the record linkage model that was introduced by Newcombe (1959, 1962) and given a formal mathematical foundation by Fellegi and Sunter (1969). The basic ideas are based on statistical concepts such as odds ratios, hypothesis testing, and relative frequency. Because much of the data is based on textual information such as names and addresses, most of the advances in record linkage methods have been in the computer science literature. In particular, methods of string comparison for accounting for typographical error (Winkler 1990a, Cohen et al. 2003a,b), methods of parsing names and addresses into components that correspond and can be more readily compared (e.g., Winkler 1995, Borkar et al. 2001, Christen et al. 2002, Churches et al. 2002), and automated methods of obtained optimal record linkage without training data (Winkler 1989a, 1993b) and with training data (Bilenko et al. 2003). The fourth section covers a number of methods of research for improving matching efficacy. Many of the ideas are being applied in particular settings. The combination of methods is likely to yield significant improvements in matching business lists. Business lists are typically more difficult to match than other types of lists such as person lists or agriculture lists because of the variants of names and the variants of addresses (Winkler 1995). The fifth section is concluding remarks.

2. Background

Record linkage of files (Fellegi and Sunter 1969) is used to identify duplicates when unique identifiers are unavailable. It relies primarily on matching of names, addresses, and other fields that are typically not unique identifiers of entities. Matching businesses using business names and other information can be particularly difficult (Winkler 1995). Record linkage is also called object identification (Tejada et al. 2001, 2002), data cleaning (Do and Rahm 2000), approximate matching or approximate joins (Gravanao et al. 2001, Guha et al.2004), fuzzy matching (Ananthakrisha et al. 2002), and entity resolution (Benjelloun et al. 2005).
Rahm and Do (2000) provided an overview of data cleaning and some research problems. Tejada et al. (2001, 2002) showed how to define linkage rules in a database environment. Hernandez and Stolfo (1995) gave merge/purge methods for matching in a database situation.

Sarawagi and Bhamidipaty (2002) and Winkler (2002) demonstrated how machine-learning methods could be applied in record linkage situations where training data (possibly small amounts) were available. Ananthakrishna et al. (2002) and Jin et al. (2002, 2003) provided methods for linkage in very large files in the database environment. Cohen and Richman (2002) showed how to cluster and match entity names using methods that are scalable and adaptive. Cohen et al. (2003a,b) provide new methods of adaptive string comparators based on hidden Markov models that improve on the non-adaptive string comparators in a number of situations. Bilenko and Mooney (2003) provide adaptive, Hidden Markov methods that both standardize and parse names and addresses while comparing and matching components of them. Borkar et al. (2001), Christen et al. (2002), and Churches et al. (2002) use Hidden Markov models for adaptive name and address standardization. Wei (2004) provides new Markov edit algorithms that should improve over the algorithms that apply basic Hidden Markov models for string comparison. Lafferty et al. (2001), McCallum and Wellner (2003), and Culotta and Mcallum (2005) used conditional random fields for representing an exceptionally large number of linkage relationships in which the likelihoods are optimized via Markov Chain Monte Carlo (MCMC) and graph partitioning methods. Ishikawa (2003) provides on a more general characterization of Markov random fields, graph partitioning, and optimization of likelihoods that may yield faster computational algorithms. Chaudhuri et al. (2005) provide a theoretic characterization of matching in terms of generalized distances and certain aggregates. Benjelloun et al. (2005) provide a characterization of how pairs are brought together during a matching process.

In a substantial number of situations, the files are too big to consider every pair in the cross product space of all pairs from two files. Newcombe (1962, 1988) showed how to reduce the number of pairs considered by only considering pairs that agreed on a characteristic such as surname or date-of-birth. Such reduction in the number of pairs is called blocking. Hernandez and Stolfo (1995) also showed how to use multiple passes on a database to bring together pairs. Each pass corresponds to a different sort ordering of the files and pairs are only considered in a sliding window of fixed size. After the pairs are brought together more advanced, computeintensive methods are used for comparing them. McCallum et al. (2000) showed that the first step should be a clustering step that is performed with an easily computable, fast method (referred to as canopies) and the second step can use more expensive computational methods.

Chaudhuri et al (2003) showed how to create index structures that allow for certain types of typographical error in matching within a database. In their application, their methods reduced computation by a factor of three in comparison with naïve methods that compare every pair of records in two files. Baxter et al. (2003). used a more easily applied method based on q-grams (described later). Still more advanced methods rely of embedding the files and associated string comparators for approximate string comparison in versions of d-dimensional Euclidean space and using sophisticated R-tree bi-comparison searches (Jin et al. 2003). With the possibility of significant computation associated with the embedding algorithms, the methods are intended to reduce computation associated with comparing pairs from O(N2) to O(N) or O(NlogN) where N is the size of the file being searched. Yancey and Winkler (2004) developed BigMatch technology for matching moderate size lists of 100 million records against large administrative files having upwards of 4 billion records. The methods are faster than the other methods because they rely on models of typographical error corresponding to actual names and addresses that are not always explicitly used in some of the other new methods.

3.3 String comparators

In many matching situations, it is not possible to compare two strings exactly (character-by-character) because of typographical error. Dealing with typographical error via approximate string comparison has been a major research project in computer science (see e.g., Hall and Dowling 1980, Navarro 2001). In record linkage, one needs to have a function that represents approximate agreement, with agreement being represented by 1 and degrees of partial agreement being represented by numbers between 0 and 1. One also needs to adjust the likelihood ratios (3) according to the partial agreement values. Having such methods is crucial to matching. For instance, in a major census application for measuring undercount, more than 25% of matches would not have been found via exact character-by-character matching. Three geographic regions are considered in Table 5. The function Φ represents exact agreement when it takes value one and represents partial agreement when it takes valuesone. In the St Louis region, for instance, 25% of first names and 15% of last names did not agree character-by-character among pairs that are matches.
Jaro (1989) introduced a string comparator that accounts for insertions, deletions, and transpositions. The basic Jaro algorithm has three components: (1) compute the string lengths, (2) find the number of common characters in the two strings, and (3) find the number of transpositions. The definition of common is that the agreeing character must be within half the length of the shorter string. The definition of transposition is that the character from one string is out of order with the corresponding common character from the other string.

4. Current research

This section considers thirteen areas of current research.

The first area consists of methods for automatically or semi-automatically estimating error rates. The second area consists of methods for using information from additional files to improve the matching in two files A and B. The third area covers mechanisms for bringing together pairs in very large files in the presence of moderate or significant typographical error in individual fields. The fourth area is methods for constructing functions y = f(x) where x in A and y in B and associated comparison metrics so that additional information can be used in improving matching. The fifth area provides methods of creating large linkage graphs of information from multiple files in which the linkage information and the information in fields are compared. In the sixth area, we consider methods of analysis of merged data that partially compensate for linkage error. The seventh area considers the effects of relative frequency of weakly identifying strings, methods for estimating typographical error rates, and situations where lists have low overlap rates. In the eighth area, we consider where additional fields, lack of independence, and typographical error rates can affect matching. The ninth area covers existing classification algorithms that are used for record linkage. The tenth area provides an overview of methods of string comparison and closely related extensions for longer fields. In the eleventh area, we consider methods of standardization and data extraction. The twelfth area provides methods for maintaining population registers and other large files. In the thirteenth area, we consider how multiple addresses (and other fields tracked over time) can significantly improve matching.

4.2 Auxiliary information for matching files A and b

A bridging file is a file that can be used in improving the linkages between two other files. Typically, a bridging file might be an administrative file that is maintained by a governmental unit.

5. Concluding remarks

This document provides an overview of record linkage. Record linkage is also referred to as data cleaning or object identification. It gives background on how record linkage has been applied in matching lists of businesses. It points out directions of research for improving the linkage methods.

References

J. M. Abowd, and L. Vilhuber. (2005). “The Sensitivity of Economic Statistics to Coding Errors in Personal Identifiers (with discussion).” In: Journal of Business and Economic Statistics, 23 (2), 133-165.
J. M. Abowd, and S. D. Woodcock. (2004). “Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data.” In: J. Domingo-Ferrer, and V. Torra, eds., Privacy in Statistical Databases 2004. Springer.
J. M. Abowd, and S. D. Woodcock. (2002). “Disclosure Limitation in Longitudinal Linked Data.” In: P. Doyle et al., eds. Confidentiality, Disclosure, and Data Access. North Holland: Amsterdam.
Agichstein, E., and Ganti, V. (2004), “Mining Reference Tables for Automatic Text Segmentation,” ACM Knowledge Discovery and Data Mining Conference 2004, 20-29.
Alvey, W. and Jamerson, B. (eds.) (1997), Record Linkage Techniques -- 1997 (Proceedings of An International Record Linkage Workshop and Exposition, March 20-21, 1997, in Arlington VA), also published by National Academy Press (1999) and available at http://wwww.fcsm.gov under methodology reports.
Ananthakrishna, R., Chaudhuri, S., and Ganti, V. (2002), “Eliminating Fuzzy Duplicates in Data Warehouses,” Very Large Data Bases 2002, 586-597.
Baxter, R., Christen, P., and Churches, T. (2003), “A Comparison of Fast Blocking Methods for Record Linkage,” Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington, DC, August 2003.
Belin, T. R., and Rubin, D. B. (1995), "A Method for Calibrating False- Match Rates in Record Linkage," Journal of the American Statistical Association, 90, 694-707.
Benjelloun, O., Garcia-Molina, H., Su, Q., and Widom, J. (2005), “Swoosh: A Generic Approach to Entity Resolution,” Stanford University technical report, March 2005.

Bentley, J.L., and Sedgewick, R.A. (1960), “Fast Algorithms for Searching and Sorting Strings, Proceedings of the Eighth ACM-SIAM Symposium on Discrete Algorithms,” 360-369. Bertolazzi, P., De Santis, L., and Scannapieco, M. (2003), “Automatic Record Matching in Cooperative Information Systems,” Workshop on Data Quality in Cooperative Information Systems, Siena, Italy, January, 2003. Bilenko, M. and Mooney, R. J. (2003a), “Adaptive Duplicate Detection Using Learnable String Similarity Metrics,” Proceedings of ACM Conference on Knowledge Discovery and Data Mining, Washington, DC, August 2003, 39-48. Bilenko, M. and Mooney, R. J. (2003b), On Evaluation and Training-Set Construction for Duplicate Detection,” Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington DC, August 2003. Bilenko, M., Mooney, R., William W. Cohen, Ravikumar, P., and Fienberg, S. (2003), “Adaptive Name Matching in Information Integration,” IEEE Intelligent Systems, 18 (50), 16-23. Bertolazzi, P., De Santis, L., and M. Scannapieco, M. (2003)., “Automatic Record Matching in Cooperative Information Systems, “Proceedings of the Workshop on Data Quality in Cooperative Information Systems, Siena, Italy, January 2003. Borkar, V., Deshmukh, K., and Sunita Sarawagi (2001), “Automatic Segmentation of Text into Structured Records,” Association of Computing Machinery SIGMOD 2001, 175-186. Brefeld, U. and Scheffer, T. (2004), “Co-EM Support Vector Learning,” Proceedings of the 21st International Conference on Machine Learning.

Bruni, R. (2004), “Discrete Models for Data Imputation,” Discrete Applied Mathematics, 144, 59-69.
Bruni, R. (2005), “Error Correction for Massive Datasets,” Optimization Methods and Software, 20 (2-3), 297-316.
(Chaudhuri et al., 2003) ⇒ Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani. (2003). “Robust and Efficient Fuzzy Match for Online Data Cleaning.” In: Proceedings of the 2003 [[ACM SIGMOD] Conference (SIGMOD 2003).
Chaudhuri, S., Ganti, V., and Rajeev Motwani (2005), “Robust Identification of Fuzzy Duplicates,” IEEE International Conference on Data Engineering, 865-876.

Christen, P. Churches, T. and Zhu, J.X. (2002). “Probabilistic Name and Address Cleaning and Standardization,” (The Australian Data Mining Workshop, November, 2002), available at http://datamining.anu.edu.au/projects/linkage.html . Churches, T., Christen, P., Lu, J. and Zhu, J. X. (2002), “Preparation of Name and Address Data for Record Linkage Using Hidden Markov Models,” BioMed Central Medical Informatics and Decision Making, 2 (9), available at http://www.biomedcentral.com/1472-6947/2/9/. William W. Cohen, Ravikumar, P., and Fienberg, S. E. (2003a), “A Comparison of String Metrics for Matching Names and Addresses,” International Joint Conference on Artificial Intelligence, Proceedings of the Workshop on Information Integration on the Web, Acapulco, Mexico, August 2003. William W. Cohen, Ravikumar, P., and Fienberg, S. E. (2003b), “A Comparison of String Distance Metrics for Name-Matching Tasks,” Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington DC, August 2003. William W. Cohen and Richman, J. (2002), “Learning to Match and Cluster Entity Names,” ACM SIGKDD 2002, 475-480. William W. Cohen, and Sunita Sarawagi (2004), “Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods,” Proceedings of the ACM Knowledge Discovery and Data Mining Conference 2005, 89-98. Cooper, W. S. and Maron, M. E. (1978), “Foundations of Probabilistic and Utility-Theoretic Indexing,” Journal of the Association of Computing Machinery, 25, 67-80. Culotta, A., and Andrew McCallum (2005), “Joint Deduplication of Multiple Record Types in Relational Data,” CIKM 2005. Della Pietra, S., Della Pietra, V., and John D. Lafferty (1997), “Inducing Features of Random Fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence]], 19, 380-393. Deming, W. E., and Gleser, G. J. (1959), "On the Problem of Matching Lists by Samples," Journal of the American Statistical Association, 54, 403-415. Arthur P. Dempster, Laird, N. M. and Rubin, D. B. (1977), "Maximum Likelihood from Incomplete Data via the EM Algorithm," Journal of the Royal Statistical Society, B, 39, 1-38. Do, H.-H. and Rahm, E. “COMA – A system for flexible combination of schema matching approaches,” Very Large Data Bases 2002, 610-621. Dong, X., Halevy, A., and Madhavan, J. (2005), “Reference Reconciliation in Complex Information Spaces,” Proceedings of the ACM SIGMOD Conference 2005, 85-96. Elfekey, M., Vassilios, V., and Elmagarmid, A. “TAILOR: A Record Linkage Toolbox,” IEEE International Conference on Data Engineering 2002, 17-28 Fan, W., Davidson, I., Zadrozny, B., and Yu, P. (2005), “An Improved Categorization of Classifier’s Sensitivity on Sample Selection Bias, http://www.cs.albany.edu/~davidson/Publications/samplebias.pdf . Fayad, U. and Piatetskey-Shipiro, G, and Padhraic Smyth (1996), “The KDD Process of Extracting Useful Information from Volumes of Data,” Communications of the Association of Computing Machinery, 39 (11), 27-34. Fayad, U. and Uthurusamy, R. (1996), “Data Mining and Knowledge Discovery in Data Bases,” Communications of the Association of Computing Machinery, 39 (11), 24-26. Fayad, U. and Uthurusamy, R. (2002), “Evolving Data Mining into Solutions for Insights,” Communications of the Association of Computing Machinery, 45 (8), 28-31. Fellegi, I. P., and Sunter, A. B. (1969), "A Theory for Record Linkage," Journal of the American Statistical Association, 64, 1183-1210. Ferragina, P. and Grossi, R. (1999), “The String B-tree: a New Data Structure for String Search in External Memory and Its Applications,” Journal of the Association of Computing Machinery, 46 (2), 236-280. Yoav Freund and Robert E. Schapire (1996), “Experiments with a New Boosting Algorithm,” Machine Learning: Proceedings of the Thirteenth International Conference, 148-156. Jerome H. Friedman, Trevor Hastie, Tibshirani, R. (2000), “Additive Logistic Regression: a Statistical View of Boosting,” Annals of Statistics, 28, 337-407. Gill, L. (1999), “OX-LINK: The Oxford Medical Record Linkage System,” in Record Linkage Techniques 1997, Washington, DC: National Academy Press, 15-33. Getoor, L., Friedman, N., Koller, D., and Taskar, B. (2003), “Learning Probabilistic Models for Link Structure,” Journal Machine Learning Research, 3, 679-707. Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, and Srivastava, D. (2001), “Approximate String Joins in a Database (Almost) for Free,” Proceedings of VLDB, 491-500. Guha, S., Koudas, N., Marathe, A., and Srivastava, D. (2004), “Merging the Results of Approximate Match Operations,” Proceedings of the 30th VLDB Conference, 636-647. Hall, P. A. V. and Dowling, G. R.(1980), “Approximate String Comparison,” Association of Computing Machinery, Computing Surveys, 12, 381-402. Trevor Hastie, Tibshirani, R., and Jerome H. Friedman (2001), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer: New York. Hernandez, M. and Stolfo, S. (1995), “The Merge-Purge Problem for Large Databases,” Proceedings of ACM SIGMOD 1995, 127-138. Hjaltson, G. and Samet, H. (2003), “Index-Driven Similarity Search in Metric Spaces,” ACM Transactions On Database Systems, 28 (4), 517-580. Ishikawa, H. (2003), “Exact Optimization of Markov Random Fields with Convex Priors,“ IEEE Transactions on Pattern Analysis and Machine Intelligence]], 25, 1333-1336. Iyengar, V. (2002), “Transforming Data to Satisfy Privacy Constraints,” ACM KDD 2002, 279-288. Jaro, M. A. (1989), "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida," Journal of the American Statistical Association, 89, 414-420. Jin, L., Li, C., and Mehrortra, S. (2002), “Efficient String Similarity Joins in Large Data Sets,” UCI technical Report, Feb. 2002, http://www.ics.uci.edu/~chenli/pub/strjoin.pdf . Jin, L., Li, C., and Mehrortra, S. (2003), “Efficient Record Linkage in Large Data Sets,” Eighth International Conference for Database Systems for Advance Applications (DASFAA 2003), 26-28 March, 2003, Kyoto, Japan, http://www.ics.uci.edu/~chenli/pub/dasfaa03.pdf . Koller, D. and Pfeffer, A. (1998), “Probabilistic Frame-based Systems,” Proceedings of the Fifteenth National Conference on Artificial Intelligence, 580-587. Koudas, N., Marathe, A., and Srivastava, D. (2004), “Flexible String Matching Against Large Databases in Practice,” Proceedings of the 30th VLDB Conference,1078-1086. John Lafferty, Andrew McCallum, and Fernando Pereira (2001), “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data,” In: Proceedings of the International Conference on Machine Learning, 282-289. Lahiri, P., and Larsen, M. D. (2000), “Model-based Analysis of Records Linked Using Mixture Models,” American Statistical Association, Proceedings of the Section on Survey Research Methods, 11-19. Lahiri, P. A. and Larsen, M. D. (2005 “Regression Analysis with Linked Data,” Journal of the American Statistical Association, 100, 222-230. Larsen, M. (1999), “Multiple Imputation Analysis of Records Linked Using Mixture Models,” Statistical Society of Canada, Proceedings of the Survey Methods Section, 65-71. Larsen, M. D. (2005), “Hierarchical Bayesian Record Linkage Theory,” Iowa State University, Statistics Department Technical Report. Larsen, M. D., and Rubin, D. B. (2001), “Alterative Automated Record Linkage Using Mixture Models,” Journal of the American Statistical Association, 79, 32-41. Lu, Q. and Getoor, L. (2003), “Link-based Classification,” in (Tom Fawcett and N. Mishra, eds.) Proceedings of the Twentieth International Conference on Machine Learning, 496-503. Malin, B., Sweeney, L., and Newton, E. (2003), “Trail Re-Identification: Learning Who You Are From WhereYou Have Been,” Workshop on Privacy in Data, Carnegie-Mellon University, March 2003. McCallum, A., Bellare, K., and Fernando Pereira (2005), “A Conditional Random Field for Discriminativelytrained Finite-state String Edit Distance,” UAI 2005. McCallum, A., Nigam, K., and Unger, L. H. (2000, “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching, in Knowledge Discovery and Data Mining, 169-178. McCallum, A. and Wellner, B. (2003), “Object Consolidation by Graph Partitioning with a Conditionally- Trained Distance Metric,” Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington DC, August 2003. Michalowski, M., Thakkar, S., and Knoblock, C. A. (2003), “Exploiting Secondary Sources for Object Consolidation,” Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington DC, August 2003. Tom M. Mitchell. M. (1997), Machine Learning, New York, NY: McGraw-Hill. Navarro, G. (2001), “A Guided Tour of Approximate String Matching,” Association of Computing Machinery Computing Surveys, 33, 31-88. Newcombe, H. B. (1988), Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration, and Business, Oxford: Oxford University Press. Newcombe, H. B., Kennedy, J. M. Axford, S. J., and James, A. P. (1959), "Automatic Linkage of Vital Records," Science, 130, 954-959. Newcombe, H.B. and Kennedy, J. M. (1962) "Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information" Communications of the Association for Computing Machinery, 5, 563-567. Newcombe, H. B. and Smith, M. E. (1975), “Methods for Computer Linkage of Hospital Admission- Separation Records into Cumulative Health Histories,” Methods of Information in Medicine, 14 (3), 118-125. Ng, A. and Michael I. Jordan (2002), “On Discriminative vs. Generative classifiers: A comparison of logistic regression and naïve Bayes,” Neural Information Processing Systems 14. Pasula, H., Baskara, M., Milch, B., Russell, S., and Shipster, I. (2003), “Identity Uncertainty and Citation Matching,” NIPS 03. Pasula, H. and Russell, S. (2001), “Approximate Inference for First-Order Probabilistic Languages,” Proceedings of the International Joint Conference on Artificial Intelligence, 741-748. Pasula, H., Russell, S., Ostland, M., and Ritov, Y. (1999), “Tracking Many Objects with Many Sensors,” Proceedings of the Joint International Conference on Artificial Intelligence. Pollock, J. and Zamora, A. (1984), "Automatic Spelling Correction in Scientific and Scholarly Text," Communications of the ACM, 27, 358-368. Porter, E. H., and Winkler, W. E. (1999), “Approximate String Comparison and its Effect in an Advanced Record Linkage System,” in Alvey and Jamerson (ed.) Record Linkage Techniques - 1997, 190-199, National Research Council, National Academy Press: Washington, D.C.

Rahm, E. and Do, H.-H. (2000), “Data Cleaning: Problems and Currrent Approaches,” IEEE Bulletin on Data Engineering, 23 (4).
Ravikumar, P. and William W. Cohen (2004), “A Hierarchical Graphical Model for Record Linkage,” Proceedings of the Conference on Uncertainty in Artificial Intelligence, Banff, Calgary, CA, July 2004.
P. Ravikumar, and John D. Lafferty. (2004). “Variational Chernoff Bounds for Graphical Models.” In: Proceedings of the Conference on Uncertainty in Artificial Intelligence.
Ristad, E. S., and Yianilos, P. (1998), “Learning String-Edit Distance,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 522-531.

Russell, S. (2001), “Identity Uncertainty,” Proceedings of IFSA-01, http:/www.cs.berkeley.edu/~russell/papers/ifsai01-identity.ps . Sunita Sarawagi and Bhamidipaty, A. (2002), “Interactive Deduplication Using Active Learning, Very Large Data Bases 2002, 269-278. Scannipieco, M. (2003), “DAQUINCIS: Exchanging and Improving Data Quality in Cooperative Information Systems,” Ph.D. thesis in Computer Engineering, University of Rome “La Sapienza.”

Scheuren, F., and Winkler, W. E. (1993), "Regression analysis of data files that are computer matched," Survey Methodology, 19, 39-58.
Scheuren, F., and Winkler, W. E. (1997), "Regression analysis of data files that are computer matched, II," Survey Methodology, 23, 157-165.
(Sebastiani, 2002) ⇒ Fabrizio Sebastiani. (2002). “Machine Learning in Automated Text Categorization.” In: Association of Computing Machinery Computing Surveys, 34 (1), 1-47.
Sekar, C. C., and Deming, W. E. (1949), "On a Method of Estimating Birth and Death Rates and the Extent of Registration," Journal of the American Statistical Association, 44, 101-115.
Steel, P., and Konshnik, C. (1999), “Post-Matching Adminstrative Record Linkage Between Sole Proprietorship Tax Returns and the Standard Statistical Establishment List,” in Record Linkage Techniques 1997, Washington, DC: National Academy Press, 179-189.
Sweeney, L. (1999), “Computational Disclosure Control for Medical Microdata: The Datafly System” in

Record Linkage Techniques 1997, Washington, DC: National Academy Press, 442-453. Taskar, B., Abdeel, P., and Koller, D. (2002), “Discriminative Probabilistic Models for Relational Data,” Proceedings of the Conference on Uncertainty in Artificial Intelligence. Taskar, B., Segal, E., and Koller, D. (2001), Probabilistic Classification and Clustering in Relational Data,” Proceedings of the International Joint Conference on Artificial Intelligence. Taskar, B., Wong, M. F., Abdeel, P. and Koller, D. (2003), “Link Prediction in Relational Data,” Neural Information Processing Systems, to appear. Taskar, B., Wong, M. F., and Koller, D. (2003), “Learning on Test Data: Leveraging “Unseen” Features,” Proceedings of the Twentieth International Conference on Machine Learning, 744-751. Tejada, S., Knoblock, C., and Minton, S. (2001), “Learning Object Identification Rules for Information Extraction,” Information Systems, 26 (8), 607-633. Tejada, S., Knoblock, C., and Minton, S. (2002), “Learning Domain-Independent String Transformation for High Accuracy Object Identification, Proceedings of ACM SIGKDD ’02.

Torra, V. (2004), “OWA Operators in Data Modeling and Re-Identification,” IEEE Transactions on Fuzzy Systems, 12 (5) 652-660.
Vladimir N. Vapnik. (2000). “The Nature of Statistical Learning Theory (2nd Edition). Springer-Verlag.
Wang, S., Schuurmans, D., Peng, F., and Zhao, Y. (2003), “Learning Mixture Models with the Latent Maximum Entropy Principal,” in (Tom Fawcett and N. Mishra, eds.) Proceedings of the Twentieth International Conference on Machine Learning, 776-783 (also version for IEEE Transactions on Neural Nets in 2004).
Wei, J. (2004), Markov Edit Distance, IEEE Transactions of Pattern Analysis and Machine Intelligence]], 26 (3), 311-321.
Winkler, W. E. (1988), "Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage," Proceedings of the Section on Survey Research Methods, American Statistical Association, 667-671.

Winkler, W. E. (1989a), "Near Automatic Weight Computation in the Fellegi-Sunter Model of Record Linkage," Proceedings of the Fifth Census Bureau Annual Research Conference, 145-155. Winkler, W. E. (1989b), "Methods for Adjusting for Lack of Independence in an Application of the Fellegi-Sunter Model of Record Linkage," Survey Methodology, 15, 101-117. Winkler, W. E. (1989c), "Frequency-based Matching in the Fellegi-Sunter Model of Record Linkage," Proceedings of the Section on Survey Research Methods, American Statistical Association, 778-783. Winkler, W. E. (1990a), "String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage," Proceedings of the Section on Survey Research Methods, American Statistical Association., 354-359. Winkler, W. E. (1990b), “On Dykstra’s Iterative Fitting Procedure,” Annals of Probability, 18, 1410-1415. Winkler, W. E. (1993a) "Business Name Parsing and Standardization Software," unpublished report, Washington, DC: Statistical Research Division, U.S. Bureau of the Census. Winkler, W. E. (1993b), "Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage," Proceedings of the Section on Survey Research Methods, American Statistical Association, 274-279. Winkler, W. E. (1994), "Advanced Methods for Record Linkage," Proceedings of the Section on Survey Research Methods, American Statistical Association, 467-472 (longer version report 94/05 available at http://www.census.gov/srd/www/byyear.html). Winkler, W. E. (1995), "Matching and Record Linkage," in B. G. Cox et al. (ed.) Business Survey Methods, New York: J. Wiley, 355-384 (also available at http://www.fcsm.gov/workingpapers/ wwinkler.pdf). Winkler, W. E. (1998). “Re-identification Methods for Evaluating the Confidentiality of Analytically Valid Microdata,” Research in Official Statistics, 1, 87-104. Winkler, W. E. (1999a). “The State of Record Linkage and Current Research Problems," Statistical Society of Canada, Proceedings of the Survey Methods Section, 73-80 (longer version at http://www.census.gov/srd/www/byyear.html). Winkler, W. E. (1999b), “Issues with Linking Files and Performing Analyses on the Merged Files,” Proceedings of the Sections on Government Statistics and Social Statistics, American Statistical Association, 262-265. Winkler, W. E. (1999c), "Record Linkage Software and Methods for Administrative Lists," Eurostat, Proceedings of the Exchange of Technology and Know-How '99, also available at http://www.census.gov/srd/www/byyear.html. Winkler, W. E. (2000), “Machine Learning, Information Retrieval, and Record Linkage,” Proceedings of the Section on Survey Research Methods, American Statistical Association, 20-29. (also available at http://www.niss.org/affiliates/dqworkshop/papers/winkler.pdf). Winkler, W. E. (2001), “The Quality of Very Large Databases,” Proceedings of Quality in Official Statistics ‘2001, CD-ROM (also available at http://www.census.gov/srd/www/byyear.html as report rr01/04). Winkler, W. E. (2002), “Record Linkage and Bayesian Networks,” Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM (also at http://www.census.gov/srd/www/byyear.html). Winkler, W. E. (2003a), “Methods for Evaluating and Creating Data Quality,” Proceedings of the Workshop on Cooperative Information Systems, Sienna, Italy, January 2003, longer version in Information Systems (2004), 29 (7), 531-550. Winkler, W. E. (2003b), “Data Cleaning Methods,” Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington, DC, August 2003. Winkler, W.E. (2004a), “Re-identification Methods for Masked Microdata,” in (J. Domingo-Ferrer and V. Torra, eds.) Privacy in Statistical Databases 2004, New York: Springer, 216-230, http://www.census.gov/srd/papers/pdf/rrs2004-03.pdf . Winkler, W.E. (2004b), “Masking and Re-identification Methods for Public-Use Microdata: Overview and Research Problems,” in (J. Domingo-Ferrer and V. Torra, eds.) Privacy in Statistical Databases 2004, New York: Springer, 231-247, http://www.census.gov/srd/papers/pdf/rrs2004-06.pdf . Winkler, W. E. (2004c), “Approximate String Comparator Search Strategies for Very Large Administrative Lists,” Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM (also available as report 2005/02 at http://www.census.gov/srd/www/byyear.html). Winkler, W. E. (2006), “Sample Allocation and Stratification,” in (P.S.R.S. Rao and M. Katzoff) Handbook on Sampling Techniques and Analysis, CRC Press: London, UK. Winker, W. E., and Yancey, W. E. (2006), “Record Linkage Error-Rate Estimation without Training Data,” Proceedings of the Section on Survey Research Methods, American Statistical Association, to appear. Winker, W. E., and Yancey, W. E. (2006), “Parallel BigMatch,” technical report, to appear. Yancey, W.E. (2000), “Frequency-Dependent Probability Measures for Record Linkage,” Proceedings of the Section on Survey Research Methods, American Statistical Association, 752-757 (also at http://www.census.gov/srd/www/byyear.html). Yancey, W.E. (2002), “Improving EM Parameter Estimates for Record Linkage Parameters,” Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM (also at http://www.census.gov/srd/www/byyear.html as report RRS 2004/01). Yancey, W.E. (2003), “An Adaptive String Comparator for Record Linkage,” Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM (also report RRS 2004/02 at http://www.census.gov/srd/www/byyear.html). Yancey, W.E. (2004), “The BigMatch Program for Record Linkage,” Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM. Yancey, W.E. (2005), “Evaluating String Comparator Performance for Record Linkage,” research report RRS 2005/05 at http://www.census.gov/srd/www/byyear.html). Yancey, W.E., and Winkler, W. E. (2004), “BigMatch Software,” computer system, documentation available at http://www.census.gov/srd/www/byyear.html as report RRC 2002/01). Yancey, W.E., Winkler, W.E., and Creecy, R. H. (2002). “Disclosure Risk Assessment in Perturbative Microdata Protection,” in (J. Domingo-Ferrer, ed.) Inference Control in Statistical Databases, Springer: New York. Zadrozny, B. (2004), “Learning and Evaluating Classifiers under Sample Selection Bias,” Proceedings of the International Conference on Machine Learning, . Zhang, T. and Oles, F. (2001) “Text Categorization Based on Regularized Linear Classification Methods,” Information Retrieval, 4 (1), 5-31.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2006 OverviewOfRecordLinkageAndCurrentResDirs	William E. Winkler			Overview of record linkage and current research directions			http://www.census.gov/srd/papers/pdf/rrs2006-02.pdf			2006