2004 IterativeRecordLinkageForCleaningAndIntegration

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Entity Record Deduplication Algorithm, Link Mining, Distance Function, Clustering Algorithm, Person Record Deduplication Task.

Notes

Cited By

2008

2007

Quotes

Abstract

Record linkage, the problem of determining when two records refer to the same entity, has applications for both data cleaning (deduplication) and for integrating data from multiple sources. Traditional approaches use a similarity measure that compares tuples' attribute values; tuples with similarity scores above a certain threshold are declared to be matches. While this method can perform quite well in many domains, particularly domains where there is not a large amount of noise in the data, in some domains looking only at tuple values is not enough. By also examining the context of the tuple, i.e. the other tuples to which it is linked, we can come up with a more accurate linkage decision. But this additional accuracy comes at a price. In order to correctly find all duplicates, we may need to make multiple passes over the data; as linkages are discovered, they may in turn allow us to discover additional linkages. We present results that illustrate the power and feasibility of making use of joint information when comparing records.

1. Introduction

One of the fundamental problems in data cleaning and information integration is determining when two tuples refer to the same real-world entity. Often times data contains errors, for example typographical errors, or data may have multiple representations, such as abbreviations, so an exact comparison does not su±ce for detecting duplicates in those cases. In data cleaning, deduplication [12, 18] is important for both accurate analysis, for example determining the number of customers, and for cost-effectiveness, for ex- ample removing duplicates from direct mailing list. In information integration, determining approximate joins [5] is important for consolidating information from multiple sources; most often there will not be a unique key that can be used to join tables in distributed databases, and we must infer when two records from different databases, possibly with different structures, refer to the same entity. Traditional approaches to duplicate detection are based on approximate string matching criteria, in some cases augmented with do- main specific rules. More recently, there have been adaptive approaches which make use of multiple attributes and use labeled data [24, 2, 1, 3].

Our approach also makes use of attribute similarity measures, but in addition, it takes into account the similarity of linked objects. For example, if we are comparing two census records for `Jon Doe' and `Jonathan Doe', we should be more likely to match them if they are both married to `Jeannette Doe' and they both have dependents `James Doe', `Jason Doe' and `June Doe'. In other words, the string similarity of the attributes is taken into account, but so too is the similarity of the people to whom the person is related. This is similar in spirit to the approach taken by [1, 3]. However in our case, we do not assume that the linked objects have already been deduplicated. In fact, in our case the links are among objects of the same type, so that when we determine that two records refer to the same individual, that may in turn allow us to make additional inferences. In other words, the deduplication process is iterative.

Making the process iterative has the potential benefit that we may produce more accurate results; in particular, as noted by [1], we may be able to decrease our false positive rate because we can set our string match threshold more conservatively. But it has the down-side that the process is more expensive computationally; first, as we go beyond simply comparing attributes to comparing references, the similarity computation becomes more expensive and second, as we iterate, we must continue to update the distances as new duplicates are detected.

In this paper, we formulate the problem of iterative deduplication and present efficient algorithms. Since typically we do not have ground truth to compare against, evaluation of the results of deduplication is difficult. Here we use a parameterized data generator and present a thorough investigation of our iterative deduplication algorithm, showing the performance for a range of parameter settings for both the generator and the deduplication algorithm.

2. Related Work

There has been a large body of work on deduplication, record linkage, and co-reference detection.1 Here we review some of the main work, but the review is not exhaustive. For summary reports on deduplication and record linkage, see [28, 11, 4].

Within the statistics community, the earliest work was done by Newcombe [21]. He describes methods for limiting the number of comparisons required. Fellegi and Sunter [9] de ne a statistical framework for predicting "match" and \not match". Roughly speaking, record linkage is viewed as classification of pairs as "matched" and "unmatched", based on certain computed features of the pairs. Fellegi and Sunter describe how to estimate the parameters of their model and how to use it for classification. More recently, Winkler [26, 27, 29] builds upon the work by Fellegi and Sunter and uses a probabilistic approach with a latent match variable which is estimated using EM.

There has been extensive work on defining approximate string matching algorithms [18, 19, 20, 7] and adaptive algorithms, for example algorithms that learn string similarity measures [23, 6, 2, 8] and use active learning [24]. An important focus is on eÆcient data cleaning; examples include Hernandez and Stolfo [12] and Monge and Elkan [18]. Another area of related work is the work on identity uncertainty [22], object identi cation [25] and co-reference resolution in natural language processing. The work on co-reference resolution [15] is typically done with unstructured text; here our focus is on structured data.

One of the domains commonly used as a testbed is the citation domain [13, 16, 22, 24]. This is also the domain that we use as motivation, however as we will see, we focus on identifying authors rather than papers. The work most closely related to ours is the work of Chaudhuri et al. [3, 1]. They also make use of join information to aid in deduplication; as mentioned in the introduction, a key di erence is that they assume that the secondary tables are themselves duplicate-free. Ananthakrishna et. al. [1] use co-occurrence in dimensional hierarchies to identify duplicates. Specifically, to resolve whether two entities are duplicates, they check for co-occurrence in the children sets of the entities. We work in a more general setting where there is no hierarchy within the groups/tuples and use the entire group to check for co-occurrences. As a result, instead of the easier problem of having to match sets, we are faced with the problem of matching sets of sets. Also, we use an iterative framework for our algorithm, motivated by our recursive de nition of duplicates.

  • 1. The term deduplication is used more commonly within the database community, record linkage is the term used by statisticians and co-reference resolution is the term used more commonly by the AI and natural language processing community.

3. Motivating Example

Consider the problem of trying to construct a database of papers, authors and citations, from a collection of paper references, perhaps collected by crawling the web. A well-known example of such a system is CiteSeer [10], an autonomous citation indexing engine. CiteSeer is an important resource for CS researchers, and makes searching for electronic version of papers easier. However as anyone who has used CiteSeer can attest, there are often multiple references to the same paper, citations are not always resolved and authors are not always correctly identi ed [24, 22].

A related, even more difficult problem, is the task of integrating several citation services. While multiple references are not as much of an issue for a curated citation index such as DBLP [14], determining when entries from di fferent sources refer to the same entity is still a problem.

In the context of our motivating example, there are several potential references that must be resolved. The first is the paper resolution problem; this is the most commonly studied bibliographic entity resolution task. The second is the author resolution problem; this task is less commonly studied, and is the focus of this paper.

3.1 The paper resolution problem

Consider the following example from [24]:

  • Rakesh Agrawal, R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB-94, 1994.
  • Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In: Proceedings of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994.

Sometimes, the paper resolution can be done based simply on the title. We can use one of the many existing methods for string matching, perhaps even tuned to the task of title matching. There is additional relational information, in terms of the venue, the authors of the paper, and the citations made by the paper; this additional information may help add evidence to the fact that two references are the same. This type of entity resolution has been the focus of much of the work in citation matching [13, 16, 22, 24].

3.2 The author resolution problem

A more novel example of entity resolution is the case of author resolution. Suppose that we have two di fferent papers, and we are trying to determine if there are any authors in common between them. We can also do a string similarity match between the author names, but often references to the same person vary significantly. The most common di fference is the variety of ways in which the rst name and middle name are speci ed. For an author entity "Je rey David Ullman", we may see references "J. D. Ullman", "Je Ullman", "Ullman, J. D.", and so on. For the most part, these types of transformations can be handled by specialized code that checks for common name presentation transforms. However, we are still presented with the dilemma of determining whether a rst name or middle name is the same as some initial; while the case of matching "J. D. Ullman" and "Je rey D. Ullman" seems quite obvious, for common names such as "J. Smith" and "X. Wang" the problem is more diÆcult. Existing systems take name frequency into account and will give unusual names higher matching scores. But this still leaves the problem of determining when two references to "J. Smith" refer to the same individual.

We propose to make use of additional context information in the form of coauthor relationships. If the coauthors of "J. Smith" for these two papers are the same, then we should take this into account, and give the two references a higher matching score. But in order to do this we must have already determined that the other two author references refer to the same individual; thus it becomes a chicken and egg problem.

Consider the example shown in Figure 1, where we have four paper references, each with a title and author references. Figure 2 shows the nal result after all the author references have been correctly resolved. We begin by examining the author references to see which ones we consider to be the same. In the rst step, we might decide that all of the Aho references refer to the same individual, because Aho is an unusual last name. This corresponds to identifying r1; r4; r6 and r8 as duplicates. However, based on this name information alone, we are not quite sure whether the Ullman references (r3; r5; r7 and r10) are to the same individual, and we are certainly not sure about the Johnson references (r2 and r9). But, having decided that the Aho references are the same gives us additional information for the Ullman references. With high-con dence we consolidate the r5 and r7 Ullman references; we may also consolidate the other Ullman references, although we may not be as certain that this is correct. And, at this stage, based solely on the Aho entity consolidation, we may decide we do not have enough evidence to consolidate the Johnson references. However, after having made the Ullman consolidations, we may decide that having two common coauthors for the two references is enough evidence to tip the balance, and we decide that they refer to the same Johnson.

Thus the problem of author resolution is likely to be an iterative process; as we identify common authors, this will allow us to identify additional potential co-references. We can continue in this fashion until all of the entities have been resolved.

4. Formulation of the Entity Resolution Problem

In the entity resolution problem, we have some collection of references to objects and from this set of references we would like to identify the (unique, minimal) collection of individuals or entities to which they should be mapped. In other words, we would like to find a many-to-one mapping from references to entities. But the problem is that we don't know the set of entities. Because references themselves may be many-many, such as the authorship relationship between papers and authors, we may be in the sticky situation of trying to identify more then one class of entities at the same time. Here, for simplicity, we consider the case where we have a single entity class to consolidate.

In the single entity consolidation problem, we have a collection of references R = fr1; r2; : : : ; rng. Each reference r corresponds to a unique entity, E(r) 2 fe1; e2; : : : ; ekg and conversely R(e) = frijE(ri) = eg. The references may be partitioned into groups G = fg1; g2; : : : ; gmg; a reference appears in only one group. Our task is, given R and G, to correctly determine both the entities (including the number of entities k) and the mapping from references to entities. To make this more concrete, consider our earlier citation example. Figure 2 shows the correspondence between objects and variables. The references correspond to authorship relationships, such as (J. D. Ullman, Paper1). Let's call this r3. Then E(r3) = e3, where e3 corresponds to the entity "Jeffrey David Ullman". The groups correspond to the set of author references for each of the papers. We assume that there is one group per paper entity and that there is a one-to-one mapping between groups and papers. For example, in our earlier example, paper1 de nes the group g1 = fr1; r2; r3g, where r1 = A. V. Aho, r2 = S. C. Johnson, and r3 = J. D. Ullman. Note that this is a group of references, not entities. Once we have correctly constructed the entity set for the authors, then we will be able to map from a paper to the actual authors of the paper. Thus, we want to learn that the entity sets are e1, e2 and e3. The references belonging to e1 are fr1; r4; r6; r8g, e2 = fr3; r5; r7; r10g and e3 = fr2; r9g.

Not surprisingly, we do this by clustering references that are similar to each other. The key to the success of this clustering algorithm is the similarity measure (or, equivalently, the distance measure). We de ne a distance measure that takes into account both the attributes of the objects and the links between objects.

The attribute similarity of two references measures the similarity of the attributes of two references. For references ri and rj, it measures the similarity between two author names. If there are other attributes known, such as the author's institutions, then these can be factored in as well. There has been signi ficant work on computing attribute similarity, several packages, such as [17], that implement this are available, so we assume that this measure is given.

In addition, to compute the similarity of the relationships that two objects participate in, we must compare their groups. But if we simply compare whether the references are the same, then we will not find any overlap | as we de ned them, references correspond to authorship relations. Instead, we are interested in the authors themselves. But these are the entities that we are trying to nd. Instead, we compare two groups by looking at which references are currently known/believed to be duplicates. So clearly, this notion of similarity is bound to the current set of duplicates and will change as new duplicates are found. In the following section, we describe an iterative clustering algorithm that leverages this dynamic nature of the similarity.

5. ITERATIVE DEDUPLICATION

Our definition of duplicates is based on a distance measure between references.

Algorithm

Input
  • n refs. {r1,r2,…,rn} to k entities {e1,e2,…,ek}, where:
  • n > [math]\displaystyle{ k }[/math], and R(e) = {ri | E(ri) = e}
  • m links {l1,l2,…,lm}, where:
  • Each reference r appears only in one link.
  • Each link l is a collection of references.
Task
  • Determine entities {e1,e2,…,ek}, including quantity k.
  • The mapping of references to entities, E(r).
Approach
  • Iteratively cluster ‘similar’ references into entities
Link Distance
  • d(ri, rj) is the distance between two refs.
  • d(ri, rj) = (1-a)*dattr(ri, rj) + a *dlink(L(ri), L(rj))
  • dattr(ri, rj) is the distance between attributes
  • dlink(Li, Lj) is the distance between the 2 links
  • a is a weighting of the two distances.

References

  • 6 W. Cohen and J. Richman. Learning to match and cluster entity names. In ACM SIGIR-2001 Workshop on Mathematical/Formal Methods in Information Retrieval, New Orleans, LA, Sept. 2001.
  • 7 William W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, pages 73--78, Acapulco, Mexico, Aug. 2003.
  • 8 William W. Cohen, Jacob Richman, Learning to match and cluster large high-dimensional data sets for data integration, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23-26, 2002, Edmonton, Alberta, Canada doi:10.1145/775047.775116
  • 9 I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183--1210, 1969.
  • 10 C. Lee Giles, Kurt D. Bollacker, Steve Lawrence, CiteSeer: an automatic citation indexing system, Proceedings of the third ACM conference on Digital libraries, p.89-98, June 23-26, 1998, Pittsburgh, Pennsylvania, United States doi:10.1145/276675.276685
  • 11 L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, Canberra, Australia, April 2003.
  • 12 Mauricio A. Hernández, Salvatore J. Stolfo, The merge/purge problem for large databases, Proceedings of the 1995 ACM SIGMOD Conference, p.127-138, May 22-25, 1995, San Jose, California, United States
  • 13 J. A. Hylton. Identifying and merging related bibliographic records. Master's thesis, Department of Electrical Engineering and Computer Science, MIT, 1996.
  • 14 Michael Ley, The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives, Proceedings of the 9th International Symposium on String Processing and Information Retrieval, p.1-10, September 11-13, 2002
  • 15 Andrew McCallum and Ben Wellner. Toward conditional models of identity uncertainty with application to proper noun coreference. In: Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, pages 79--86, Acapulco, Mexico, Aug. 2003.
  • 16 Andrew McCallum, Kamal Nigam, Lyle H. Ungar, Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching, Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p.169-178, August 20-23, 2000, Boston, Massachusetts, United States doi:10.1145/347090.347123
  • 17 TAILOR: A Record Linkage Tool Box, Proceedings of the 18th International Conference on Data Engineering, p.17, February 26-March 01, 2002
  • 18 Alvaro E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 267--270, Portland, OR, August 1996.
  • 19 Alvaro E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the SIGMOD 1997 Workshop on Research Issues on Data Mining and Knowledge Discovery, pages 23--29, Tuscon, AZ, May 1997.
  • 20 Gonzalo Navarro, A guided tour to approximate string matching, ACM Computing Surveys (CSUR), v.33 n.1, p.31-88, March 2001 doi:10.1145/375360.375365
  • 21 Howard B. Newcombe, J. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. Science, 130:954--959, 1959.
  • 22 H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In Advances in Neural Information Processing Systems 15. MIT Press, 2003.
  • 25 Sheila Tejada, Craig A. Knoblock, Steven Minton, Learning object identification rules for information integration, Information Systems, v.26 n.8, p.607-633, December 2001 doi:10.1016/S0306-4379(01)00042-4
  • 26 W. E. Winkler. String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association, pages 354--359, 1990.
  • 27 W. E. Winkler. Improved decision rules in the fellegi-sunter model of record linkage. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC, 1993.
  • 28 W. E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC, 1999.
  • 29 W. E. Winkler. Methods for record linkage and Bayesian networks. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC, (2002).

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 IterativeRecordLinkageForCleaningAndIntegrationIndrajit Bhattacharya
Lise Getoor
Iterative Record Linkage for Cleaning and IntegrationProceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discoveryhttp://cgis.cs.umd.edu/~indrajit/DOCS/dmkd04.pdf10.1145/1008694.10086972004