Pairwise Entity Record Deduplication Algorithm

From GM-RKB
Jump to navigation Jump to search

A Pairwise Entity Record Deduplication Algorithm is an Entity Record Deduplication Algorithm that uses a Binary Classification Algorithm to determine whether two Entity Records are in a Reference Relation.



References

  • (BenjellounGSW, 2008) ⇒ Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. (2008). “Swoosh: A generic approach to entity resolution.” In: VLDB Journal, (2008).
    • In this paper we take a “generic approach” for solving ER, i.e., we do not study the internal details of the functions used to compare and merge records. Rather, we view these functions as “black-boxes” to be invoked by the ER engine. (Incidentally, there has been a lot of work done on the design of effective comparison and merge functions; see Section 6.) Given such black-boxes, we study algorithms for efficiently performing ER, i.e., we develop strategies that minimize the number of invocations to these potentially expensive blackboxes. In a way, our work is analogous to the design of efficient join algorithms, except that the operator we study is the ER operator. An important component of our work is that we identify a set of properties that, if satisfied by the match and merge functions, lead to significantly more efficient ER. For example, associativity of merges is one such important property: If merges are not associative, the order in which records are merged may impact the final result. Another notable feature is that we do not perform the matching and merging separately, but tightly integrate them into a single process.
    • In this paper we focus on “pairwise ER,” a common way to resolve records in the commercial world. In particular, the following assumptions are made:
    • Pairwise decisions. Our black-box functions to match and merge records operate on two records at a time. Their operation depends solely on the data in these records, and not on the evidence in other records. In general, it is easier for application specialists to write pairwise record comparison and merge functions, as opposed to, say, functions that determine when a group of records may represent the same entity. Note that this requirement needs only be true at the time ER is performed, and does not preclude a prior training phase that considers the whole dataset, or a representative sample. (For example, a first phase can compute term frequencies for say all product descriptions, and the frequencies can then be used in comparing pairs of descriptions.) Thus, approaches based on machine learning can be leveraged to match or merge records.