2010 FrameworksforEntityMatchingACom

Jump to navigation Jump to search

Subject Headings: Record Deduplication System.


Cited By


Author Keywords

Entity resolution Entity matching Matcher combination Match optimization Training selection


Entity matching is a crucial and difficult task for data integration. Entity matching frameworks provide several methods and their combination to effectively solve different match tasks. In this paper, we comparatively analyze 11 proposed frameworks for entity matching. Our study considers both frameworks which do or do not utilize training data to semi-automatically find an entity matching strategy to solve a given match task. Moreover, we consider support for blocking and the combination of different match algorithms. We further study how the different frameworks have been evaluated. The study aims at exploring the current state of the art in research prototypes of entity matching frameworks and their evaluations. The proposed criteria should be helpful to identify promising framework approaches and enable categorizing and comparatively assessing additional entity matching frameworks and their evaluations.

1. Introduction

Entity matching (also referred to as duplicate identification, record linkage, entity resolution or reference reconciliation) is a crucial task for data integration and data cleaning [19,33,47]. It is the task of identifying entities (objects, data instances) referring to the same real-world entity. Entities to be resolved may reside in distributed, typically heterogeneous data sources or in a single data source, e.g., in a database or a search engine store. They may be physically materialized or dynamically be requested from sources, e.g., by database queries or keyword searches.

Entity matching is a challenging task particularly for entities that are highly heterogeneous and of limited data quality, e.g., regarding completeness and consistency of their descriptions. Table 1 illustrates some of the problems for a bibliographic example of three duplicate entries for the same paper. It is assumed that the bibliographic entities have automatically been extracted from fulltext documents and may thus contain numerous quality problems such as misspelled author names, different ordering of authors, and heterogeneous venue denominations.

To investigate the state of the art in this respect we comparatively analyze 11 proposed entity matching frameworks. We focus on research prototypes but do not consider the more general system approaches on data cleaning and data integration, such as AJAX [27] and Potters’s Wheel [48]. We also exclude commercial systems such as ChoiceMaker, DataCleanser (EDD), Merge/Purge Library (Sagent/QM Software) or MasterMerge (Pitney Bowes) from our discussion since they are not widely available and their algorithms are not described in the public literature.



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2010 FrameworksforEntityMatchingAComErhard Rahm
Hanna Köpcke
Frameworks for Entity Matching: A Comparison10.1016/j.datak.2009.10.0032010