2008 SwooshAGenericApproachToEntityResolution

(Benjelloun et al., 2008) ⇒ Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, Jennifer Widom. (2008). “Swoosh: A generic approach to entity resolution.” In: VLDB Journal, (2008).

Subject Headings: Record Coreference Resolution Algorithm, Pairwise Record Coreference Resolution Algorithm, Data Cleaning, SERF System.

Notes

see also
- Omar Benjelloun, Hector Garcia-Molina, Hideki Kawai, Tait Eliott Larson, David Menestrina, Qi Su, Sutthipong Thavisomboon, and Jennifer Widom. (2006).

Generic Entity Resolution in the SERF Project.” In: IEEE Data Engineering Bulletin, June 2006.

Cited By

Cited by ~41
http://infolab.stanford.edu/serf/swoosh_vldbj.pdf

Quotes

Abstract

We consider the Entity Resolution (ER) problem (also known as deduplication, or merge-purge), in which records determined to represent the same real-world entity are successively located and merged. We formalize the generic ER problem, treating the functions for comparing and merging records as black-boxes, which permits expressive and extensible ER solutions. We identify four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms. We develop three efficient ER algorithms: G-Swoosh for the case where the four properties do not hold, and R-Swoosh and F-Swoosh that exploit the 4 properties. F-Swoosh in addition assumes knowledge of the “features” (e.g., attributes) used by the match function. We experimentally evaluate the algorithms using comparison shopping data from Yahoo! Shopping and hotel information data from Yahoo! Travel. We also show that R-Swoosh (and F-Swoosh) can be used even when the four match and merge properties do not hold, if an “approximate” result is acceptable.

Introduction

Entity Resolution (ER) (sometimes referred to as deduplication) is the process of identifying and merging records judged to represent the same real-world entity. ER is a well-known problem that arises in many applications. For example, mailing lists may contain multiple entries representing the same physical address, but each record may be slightly different, e.g., containing different spellings or missing some information. As a second example, consider a company that has different customer databases (e.g., one for each subsidiary), and would like to consolidate them. Identifying matching records is challenging because there are no unique identifiers across databases. A given customer may appear in different ways in each database, and there is a fair amount of guesswork in determining which customers match.

Deciding if records match is often computationally expensive and application specific. For instance, a customer information management solution from a company we have been interacting with uses a combination of nickname algorithms, edit distance algorithms, fuzzy logic algorithms, and trainable engines to match customer records. On the latest hardware, the speeding of matching records ranges from 10M to 100M comparisons per hour (single threaded), depending on the parsing and data cleansing options executed. A record comparison can thus take up to about 0.36ms, greatly exceeding the runtime of any simple string/numeric value comparison. How to match and combine records is also application specific. For instance, the functions used by that company to match customers are different from those used by others to match say products or DNA sequences.

In this paper we take a “generic approach” for solving ER, i.e., we do not study the internal details of the functions used to compare and merge records. Rather, we view these functions as “black-boxes” to be invoked by the ER engine. (Incidentally, there has been a lot of work done on the design of effective comparison and merge functions; see Section 6.) Given such black-boxes, we study algorithms for efficiently performing ER, i.e., we develop strategies that minimize the number of invocations to these potentially expensive blackboxes. In a way, our work is analogous to the design of efficient join algorithms, except that the operator we study is the ER operator. An important component of our work is that we identify a set of properties that, if satisfied by the match and merge functions, lead to significantly more efficient ER. For example, associativity of merges is one such important property: If merges are not associative, the order in which records are merged may impact the final result. Another notable feature is that we do not perform the matching and merging separately, but tightly integrate them into a single process.

In this paper we focus on “pairwise ER,” a common way to resolve records in the commercial world. In particular, the following assumptions are made:

Pairwise decisions. Our black-box functions to match and merge records operate on two records at a time. Their operation depends solely on the data in these records, and not on the evidence in other records. In general, it is easier for application specialists to write pairwise record comparison and merge functions, as opposed to, say, functions that determine when a group of records may represent the same entity. Note that this requirement needs only be true at the time ER is performed, and does not preclude a prior training phase that considers the whole dataset, or a representative sample. (For example, a first phase can compute term frequencies for say all product descriptions, and the frequencies can then be used in comparing pairs of descriptions.) Thus, approaches based on machine learning can be leveraged to match or merge records.
No confidences. We do not work with numeric similarity values or confidences. Record comparison functions may indeed compute numeric similarities (e.g., how close is this name to that name), but in the end they make a yes-no decision as to whether records match. Carrying confidences in the ER computations could in principle lead to more accurate decisions, but complicates processing significantly. For instance, one must decide how to combine confidences when records are merged. Also, confidences may decrease upon merges, which makes it more challenging to compare the information in merged records to that of base records. In a technical report [26], we study generic ER with confidences, in an extension of the framework presented here, where confidences are also handled by the black-box match and merge functions.
No relationships. In our model, records contain all the information that pertains to each entity (See Figure 1 for an example). We do not consider a separate class of records that describe relationships between entities. Of course, some relationships can be represented in our model: for example, say Fred is Bill’s brother. Then the record for Fred may contain the value “brother: fBillg”. – Consistent labels. We assume that the input data has gone through a schema-level integration phase, where incoming data is mapped to a common set of well-defined labels. For instance, we assume that a “salary” label means the same thing, no matter what the source of the information is. However, we do not impose a rigid structure on records: we allow missing or multiple values for each label.

The particular variant of the ER problem that we study in this paper may not be the most sophisticated, but is used frequently in practice, at least in the commercial world. Indeed, IBM’s recently introduced “DB2 Entity Analytic Solu- tions” [21] (formerly SRD) provides an exact, order insensitive solution to the ER problem (applied to human identities), which abstracts away from the particular functions used to compare values. Another leading commercial offering from Fair Isaac Corp. also encapsulates the match process as pairwise Boolean functions [10]. The customer information management company uses a pairwise matching framework to which a combination of comparison algorithms can be applied. Although these products have extra features, the core of their approach is the same as ours. In fact, their commercial success originally motivated our study of this particular approach to entity resolution (see Section 6 for an overview of alternative techniques). In summary, the ER variant we address here is relatively simple, but as we will see, can still be very expensive to compute. One fundamental cause of this complexity in ER is that record merges can lead to new matches. To illustrate, consider the records of Figure 1. Suppose that our black-box record match function works as follows: The function compares the name, phone and email values of the two records. If the names are very similar (above some threshold), the records are said to match. The records also match if the phone and email are identical. For matching records, the black-box merge function combines the names into a “normalized” representative, and performs a set-union on the emails and phone numbers. Note that phone and e-mail are being treated as a unit for comparison purposes.We call such a unit a feature (defined formally in Section 4). Thus, in this example, there are two features: one is “name” and the other is the pair “phone+ email”.

Fig. 1 An instance of records representing persons

For our example, the black-box comparison function determines that r1 and r2 match, but r3 does not match either r1 or r2. For instance, the function finds that “John Doe” and “J. Doe” are similar, but finds “John D.” not similar to anything (e.g., because John is a frequent first name). Thus, r1 and r2 merge into a new record r4:

r4 fJohn Doeg f234-4358, fjdoe@yahoog 235-2635g Now notice that r4 now matches r3 since the same phone and e-mail appear in both records. The combination of the information in r1 and r2 led us to discover a new match with r3, therefore yielding an initially unforeseen merge. Thus, every time two records are merged, the combined record needs to be re-compared with “everything else”.

Because record matching is inherently expensive, large sets of input records are often divided into “buckets” using application knowledge, and then ER is run on each bucket. For instance, if we are resolving products, we may be able to divide them using a “category” field. Thus, camera records will only be matched against other cameras, CDs will only Swoosh: a generic approach to entity resolution 3 be matched against other CDs, and so on. If a record may match records in more than one category, then typically copies of the record are placed in multiple buckets. For example, a cell phone with a camera may be placed in the camera and the telephone buckets. (In our related work section we briefly mention other ways in which domain knowledge can be used to prune the search space.) In this paper we focus on resolving records within one bucket, that is, we study algorithms that must exhaustively consider all (within bucket) possible record matches. This type of exhaustive algorithm is invoked by a higher-level process that divides the data and decides what buckets need to be resolved. And since buckets can be quite large, it is still important to have as efficient an algorithm as possible for exhaustive ER. (Note that if the semantic function that divides records is imprecise, then overall matches may be missed, e.g., two wet-suits may be incorrectly placed in different buckets, say clothing and sporting goods. In this paper we do not consider the accuracy of the semantic function that partitions records.)

In summary, in this paper we make the following contributions:

We formalize the generic ER problem (Section 2). Unlike

other works that focus only on identifying matching records (see related work in Section 6), we also include the process of merging records and how it may lead to new matches.

We identify the ICAR properties (see Section 2.2) of match and merge functions that lead to efficient strategies.
We present ER algorithms for three scenarios:
- G-Swoosh: The most general ER algorithm, for the case where the 4 properties of match and merge functions do not hold (Section 3.1).
- R-Swoosh: An algorithm that exploits the 4 properties of match and merge functions, and that performs comparisons at the granularity of records (Section 3.2).
- F-Swoosh: An algorithm that also exploits the 4 properties, and uses feature-level comparison functions (Section 4.1). F-Swoosh avoids repeated feature comparisons and can be significantly more efficient than R-Swoosh.

For each algorithm, we show that it computes the correct ER result and that it is “optimal” in terms of the number of comparisons performed. (What we mean by “optimal” varies by scenario and is precisely defined in each section.)

We experimentally evaluate the algorithms using actual comparison shopping data from Yahoo! Shopping and hotel information data from Yahoo! Travel. Our results show that G-Swoosh can only be used on relatively small data sets when merges occur frequently, while R-Swoosh and F-Swoosh can handle substantially more data. Furthermore, when we know the features used for comparisons, we can use F-Swoosh and achieve between 1.1 and 11.4 performance improvement.
Since G-Swoosh is so expensive, we investigate using R-Swoosh even when the ICAR properties of match and merge functions do not hold. In this case R-Swoosh does not produce the correct answer, but we show that what RSwoosh produces is close to what G-Swoosh produces. Thus, if the application can tolerate an approximate answer, R-Swoosh and F-Swoosh are viable algorithms for all scenarios.

Related Work

Originally introduced by Newcombe et al. [29] as “record linkage”, entity resolution was then studied under various names, such as merge/purge [19], deduplication [31], reference reconciliation [14], object identification [36], and others.

Several works have addressed the issue of performance for ER algorithms. However, most of them make strong assumptions on the data and/or the comparison functions to make their algorithms efficient. For example, references [19, 20] assume that records can be represented by one or multiple alphanumeric keys, and that most matches occur between records whose keys are lexicographically close. A “blocking key” can be used to split records into buckets [22] or canopies [25].

Reference [23] proposed mapping the records values into a multi-dimensional Euclidean space, then performing a similarity join. An overview of such “blocking” methods can be found in [4]. Since they do not compare all records, such techniques make ER algorithms approximate, to an extent that depends on properties of the data. More recently, reference [2] proposed efficient algorithms for set similarity joins using string similarity functions. In contrast, we view the match and merge functions as black-boxes and provide exact ER algorithms that are optimal in the number of black-box invocations.

Reference [27] proposes a generic and exact approach, where the ER problem is viewed as an instance of the classical set-union problem [35], for which efficient data structures and algorithms were extensively studied. However, their work requires the record comparison function to be transitive, a property we believe is constraining and difficult to satisfy.

Iterative approaches [8,14] identified the need to transitively compare merged records to discover more matches, for merges that are simple groupings of the data in merged records. Our approach allows richer, “custom” merges. More importantly, it eliminates redundant comparisons by tightly integrating merges and comparisons, and naturally yields incremental algorithms.

Our match functions are specified as logical formulas of smaller feature-level match functions, in the style of the “equational theory” of [19], and similar in spirit to works on declarative data cleaning (e.g., [16]). Such a specification has the benefits of (1) having clear semantics, (2) allowing the kinds of optimizations we perform.

While our work focuses on performance, there has also been a significant amount of work on enhancing the precision and recall of the ER process. The first formalization, by

Fellegi and Sunter [15] optimizes the relative importance of numerical similarity functions between records, in a probabilistic setting. In this paper and most follow-ups (see [38, 18] for recent surveys), the assessment of ER is in terms of precision and recall of the obtained classification. Many string comparison techniques based on edit-distances [34], TF-IDF [13], or adaptive techniques such as q-grams [11, 17] are used for matching records. Reference [28] removes attribute level conflicts of matching records by comparing the quality of their data sources. Reference [32] provides user-defined grouping as part of an SQL extension. As domain-independent techniques may not be suitable for some domains, one may need domain-specific value comparison functions [1]. Any of these techniques can fill in the black-boxes, which we decouple from our match and merge process.

Finally, there has also been a great amount of research on non-pairwise ER, including clustering techniques [27,3,12], classifiers such as Bayesian networks [37], decision trees, SVM’s, or conditional random fields [33]. The parameters of these models are learned either from a (hopefully representatitive) set of labeled example, possibly with the help of a user [31], or in an unsupervised way [39,12].

A recent line of works focuses on the relationships among records [14,30,24,5]. Reference [9] proposed a technique to resolve entities collectively based on the relationship graph among records. Such techniques are not pairwise because they generally examine all or part of the dataset to learn match decisions. In contrast, our focus is on pairwise ER because of its practical values such as easier coding and efficiency. As we mentioned in the introduction, however, we can use non-pairwise techniques during a prior training phase.

Conclusion

Entity Resolution (ER) is an important information integration problem arising when data from diverse sources is combined. We have divided the problem into two aspects: the black-box functions that match and merge records, and the ER algorithm that invokes these functions. In our opinion, this division has two important advantages: (i) it yields generic ER algorithms that can be used, with well-defined semantics, in many applications, and (ii) it lets us focus on our performance measure, the number of black-box invocations. While there may be other factors that affect the overall runtime performance, we assume that the black-box invocations are potentially expensive and thus are the critical runtime factors. We have presented three algorithms, GSwoosh, R-Swoosh, and F-Swoosh, that make as few calls as possible, thus yielding significant performance improvements over naive algorithms. We have also presented four important yet natural ICAR properties for match and merge functions, that lead to the significantly more efficient R-Swoosh and F-Swoosh. We have argued that these properties should guide the development of merge and match functions. If the properties do not hold because of application semantics, the designers will know that ER will be inherently expensive.

References

1. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of VLDB, pp. 586–597 (2002)
2. Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)
3. Bansal, N., Blum, A., Chawla, S.: Correlation clustering. In: FOCS, pp. 238– (2002)
4. Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation (2003). URL citeseer. ist.psu.edu/article/baxter03comparison.html
5. Bekkerman, R., Andrew McCallum: Disambiguating web appearances of people in a social network. In: WWW, pp. 463–470 (2005)
6. Benjelloun, O., Garcia-Molina, H., Jonas, J., Menestrina, D.,

Whang, S., Su, Q., Widom, J.: Swoosh: a generic approach to entity resolution. Tech. rep., Stanford University (2006). Available at http://dbpubs.stanford.edu/pub/2005-5

7. Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T.E., Menestrina, D., Thavisomboon, S.: D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution. In: ICDCS (2007)
8. (Bhattacharya and Getoor, 2004) ⇒ Indrajit Bhattacharya, and Lise Getoor. (2004). “Iterative Record Linkage for Cleaning and Integration.” In: Proceedings of KDD-2004.
9. Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: 6th SIAM Conference on Data Mining (2006)
10. Blume, M.: Automatic Entity Disambiguation: Benefits to NER, Relation Extraction, Link Analysis, and Inference. In: Proceedings of the International Conference on Intelligence Analysis (2005). Available from: https://analysis.mitre.org/
11. Chaudhuri, S., Ganjam, K., Ganti, V., Rajeev Motwani: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of ACM SIGMOD, pp. 313–324 (2003)
12. Chaudhuri, S., Ganti, V., Rajeev Motwani: Robust identification of fuzzy duplicates. In: Proceedings of ICDE. Tokyo, Japan (2005)
13. William W. Cohen: Data integration using similarity joins and a wordbased information representation language. ACM Transactions on Information Systems 18, 288–321 (2000)
14. Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: Proceedings of ACM SIGMOD (2005)
15. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969)
16. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.A.: Declarative data cleaning: Language, model, and algorithms. In: Proceedings of VLDB, pp. 371–380 (2001)
17. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
18. Gu, L., Baxter, R., Vickers, D., Rainsford, C.: Record linkage: Current practice and future directions. Tech. Rep. 03/83, CSIRO Mathematical and Information Sciences (2003)
19. Hern´andez, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD, pp. 127–138 (1995)
20. Hern´andez, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)
21. IBM: DB2 Entity Analytic Solutions. Http://www-306.ibm.com/software/data/db2/eas/
22. Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association 84(406), 414–420 (1989)
23. Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Proceedings of Intl. Conference on Database Systems for Advanced Applications, pp. 137– (2003)
24. (Kalashnikov et al., 2005) ⇒ Dmitri V. Kalashnikov, Sharad Mehrotra, and Zhaoqi Chen. (2005). “Exploiting Relationships for Domain-Independent Data Cleaning.” In: Proceedings of the SIAM International Conference on Data Mining (SIAM SDM 2005)
25. Andrew McCallum, Nigam, K., Lyle H. Ungar: Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In: Proceedings of KDD, pp. 169–178. Boston, MA (2000)
26. Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: CleanDB (2006)
27. Monge, A.E., Elkan, C.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 23–29 (1997)
28. Motro, A., Anokhin, P.: Fusionplex: resolution of data inconsistencies in the integration of heterogeneous information sources. Information Fusion 7(2), 176–196 (2006)
29. Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)
30. Parag, Pedro Domingos: Multi-relational record linkage. In: Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining, pp. 31–48 (2004)
31. Sunita Sarawagi, Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD. Edmonton, Alberta (2002)
32. Schallehn, E., Sattler, K.U., Saake, G.: Extensible and similarity-based grouping for data integratio. In: ICDE, p. 277 (2002)
33. Singla, P., Pedro Domingos: Object identification with attributemediated dependences. In: Proceedings of PKDD, pp. 297 – 308 (2005)
34. Smith, T.F.,Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)
35. Tarjan, R.E.: Efficiency of a good but not linear set union algorithm. J. ACM 22(2), 215–225 (1975)
36. Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification

rules for information integration. Information Systems Journal 26(8), 635–656 (2001)

37. Verykios, V.S., Moustakides, G.V., Elfeky, M.G.: A

bayesian decision model for cost optimal record matching. The VLDB Journal 12(1), 28–40 (2003). URL http://www.cs.purdue.edu/homes/mgelfeky/Papers/vldbj12(1).pdf

38. Winkler, W.: Overview of record linkage and current research directions.

Tech. rep., Statistical Research Division, U.S. Bureau of the Census, Washington, DC (2006)

39. Winkler, W.E.: Using the EM algorithm for weight computation

in the fellegi-sunter model of record linkage. American Statistical Association, Proceedings of the Section on Survey Research Methods pp. 667–671 (1988),

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2008 SwooshAGenericApproachToEntityResolution	Omar Benjelloun Hector Garcia-Molina David Menestrina Qi Su Steven Euijong Whang Jennifer Widom			Swoosh: A generic approach to entity resolution		VLDB Journal	http://dx.doi.org/10.1007/s00778-008-0098-x	10.1007/s00778-008-0098-x		2008