Record Linkage Task

From GM-RKB
(Redirected from duplicate record detection)
Jump to navigation Jump to search

A Record Linkage Task is a coreference resolution task that requires the clustering of entity records with the same referent.



References

2017a

2017b

2012

  • (Wikipedia, 2012) ⇒ http://en.wikipedia.org/wiki/Record_linkage
    • Record linkage (RL) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), as may be the case due to differences in record shape, storage location, and/or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked. In mathematical graph theory, record linkage can be seen as a technique of resolving bipartite graphs.
  • http://en.wikipedia.org/wiki/Record_linkage#Naming_conventions
    • QUOTE: "Record linkage" is the term used by statisticians, epidemiologists, and historians, among others, to describe the process of joining records from one data source with another that describe the same entity. Commercial mail and database applications refer to it as "merge/purge processing" or "list washing". Computer scientists often refer to it as "data matching" or as the "object identity problem". Other names used to describe the same concept include "entity resolution", “identity resolution", "entity disambiguation", "duplicate detection", "record matching", "instance identification", "deduplication", “coreference resolution", "reference reconciliation", "data alignment", and "database hardening". This profusion of terminology has led to few cross-references between these research communities.

2008

  • (BenjellounGSW, 2008) ⇒ Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. (2008). “Swoosh: A generic approach to entity resolution." VLDB Journal, (2008).
    • QUOTE: Entity Resolution (ER) (sometimes referred to as deduplication) is the process of identifying and merging records judged to represent the same real-world entity. ER is a well-known problem that arises in many applications. For example, mailing lists may contain multiple entries representing the same physical address, but each record may be slightly different, e.g., containing different spellings or missing some information. As a second example, consider a company that has different customer databases (e.g., one for each subsidiary), and would like to consolidate them. Identifying matching records is challenging because there are no unique identifiers across databases. A given customer may appear in different ways in each database, and there is a fair amount of guesswork in determining which customers match.

2009

2007a

2007b

  • (Bhattacharya & Getoor, 2007) ⇒ Indrajit Bhattacharya, and Lise Getoor. (2007). “Collective entity resolution in relational data.” In: Proceedings for ACM Transactions on Knowledge Discovery from Data (TKDD)
    • QUOTE: Entity resolution is a common problem that comes in different guises (and is given different names) in many computer science domains. Examples include computer vision, where we need to figure out when regions in two different images refer to the same underlying object (the correspondence problem); natural language processing when we would like to determine which noun phrases refer to the same underlying entity (coreference resolution); and databases, where, when merging two databases or cleaning a database,we would like to determine when two tuple records are referring to the same real-world object (deduplication and data integration). Deduplication [Hern´andez and Stolfo 1995; Monge and Elkan 1996] is important for both accurate analysis, for example, determining the number of customers, and for cost-effectiveness, for example, removing duplicates from mailing lists. In information integration, determining approximate joins [Cohen 2000] is important for consolidating information from multiple sources; most often there will not be a unique key that can be used to join tables in distributed databases, and we must infer when two records from different databases, possibly with different structures, refer to the same entity. In many of these examples, co-occurrence information in the input can be naturally represented as a graph.

2006

  • (Winkler, 2006) ⇒ William E. Winkler. (2006). “Overview of record linkage and current research directions." Technical Report Statistical Research Report Series RRS2006/02, U.S. Bureau of the Census.
    • QUOTE: Record linkage is the means of combining information from a variety of computerized files. It is also referred to as data cleaning (McCallum and Wellner 2003) or object identification (Tejada et al. 2002).

      If a number of files are combined into a data warehouse, then Fayad and Uthurusamy (1996, 2002) and Fayad et al. (1996) have stated that the majority (possibly above 90%) of the work is associated with cleaning up the duplicates. Winkler (1995) has shown that computerized record linkage procedures can significantly reduce the resources needed for identifying duplicates in comparison with methods that are primarily manual. Newcombe and Smith (1975) have demonstrated the purely computerized duplicate detection in high quality person lists can often identify duplicates at greater level of accuracy than duplicate detection that involves a combination of computerized procedures and review by highly trained clerks. The reason is that the computerized procedures can make use of overall information from large parts of a list. For instance, the purely computerized procedure can make use of the relative rarity of various names and combinations of information in identifying duplicates. The relative rarity is computed as the files are being matched. Winkler (1995, 1999a) observed that the automated frequency-based (or value-specific) procedures could account for the relative rarity of a name such as ‘Martinez’ in cities such as Minneapolis, Minnesota in the US in comparison with the relatively high frequency of “Martinez’ in Los Angeles, California.

       Record linkage of files (Fellegi and Sunter 1969) is used to identify duplicates when unique identifiers are unavailable. It relies primarily on matching of names, addresses, and other fields that are typically not unique identifiers of entities. Matching businesses using business names and other information can be particularly difficult (Winkler 1995). Record linkage is also called object identification (Tejada et al. 2001, 2002), data cleaning (Do and Rahm 2000), approximate matching or approximate joins (Gravanao et al. 2001, Guha et al.2004), fuzzy matching (Ananthakrisha et al. 2002), and entity resolution (Benjelloun et al. 2005).

  • Nick Koudas, editor. (2006). “Issue on Data Quality.] IEEE Data Engineering Bulletin, volume 29.

2005

1962

  • (Newcombe and Kennedy, 1962) ⇒ H. B. Newcombe and J. M. Kennedy. (1962). “Record linkage: making maximum use of the discriminating power of identifying information. Communications of the ACM, 5:11.

1959

1946

  • Halbert L. Dunn. (1946). “Record Linkage." American Journal of Public Health 36 (12).

  1. Christen P (2012) Data matching – concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-centric systems and applications. Springer, Berlin/New York
  2. Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
  3. Herzog TN, Scheuren FJ, Winkler WE (2007) Data Quality and Record Linkage Techniques. Springer, New York/London