Record Merging Task

Jump to navigation Jump to search

A Record Merging Task is a merging task that requires the conversion of data records into a canonical data record.



  • Worldwide Web Consortium on XML Canonicalization
    • Canonical XML [XML-C14N] specifies a standard serialization of XML that, when applied to a subdocument, includes the subdocument's ancestor context including all of the namespace declarations and attributes in the "xml:" namespace. However, some applications require a method which, to the extent practical, excludes ancestor context from a canonicalized subdocument. For example, one might require a digital signature over an XML payload (subdocument) in an XML message that will not break when that subdocument is removed from its original message and/or inserted into a different context. This requirement is satisfied by Exclusive XML Canonicalization.


  • (Wick et al., 2009) ⇒ Michael Wick, Aron Culotta, Khashayar Rohanimanesh, and Andrew McCallum. (2009). “An Entity Based Model for Coreference Resolution.” In: Proceedings of the SIAM International Conference on Data Mining (SDM 2009).
    • QUOTE: In this section, we formalize the canonicalization problem and present a solution based on string edit-distance. Given a collection of citation mentions (or newswire documents annotated with a set of entity mentions) m = {m_1 ... m_n}, coreference resolution is the problem of clustering m into sets of mentions that all refer to the same underlying object (e.g., research paper in the citation or ACE entity in the newswire case). Let m_j = {m_i ... m_k} be a set of coreferent mentions, where each mention has a set of attribute-value pairs {<a_1, v_1> ... <a_p, v_p>}. Canonicalization is the task of constructing a representative set of attributes for m^j.

      Often, canonicalization is performed upon placing an entity into a relational database, either for further processing or browsing by a user. Therefore, canonicalization should create a set of attributes that are both complete and accurate. Efficiency is another motivation for canonicalization — it may be infeasible to store and reason about all mentions to each entity in the database.


  • (Culotta & alM, 2007) ⇒ Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, and Andrew McCallum. (2007). “Canonicalization of Database Records using Adaptive Similarity Measures.” In: Proceedings of KDD-2007.
    • QUOTE: Consider a research publication database such as Citeseer or Rexa that contains records gathered from a variety of sources using automated extraction techniques. Because the data comes from multiple sources, it is inevitable that an attribute such as a conference name will be referenced in multiple ways. Since the data is also the result of extraction, it may also contain errors. In the presence of this noise and variability, the system must generate a single, canonical record to display to the user.

      Record canonicalization is the problem of constructing one standard record representation from a set of duplicate records. In many databases, canonicalization is enforced with a set of rules that place limitations or guidelines for data entry. However, obeying these constraints is often tedious and error-prone. Additionally, such rules are not applicable when the database contains records extracted automatically from unstructured sources.

      Simple solutions to the canonicalization problem are often insufficient. For example, one can simply return the most common string for each field value. However, incomplete records are often more common than complete records. For instance, this approach may canonicalize a record as “J. Smith” when in fact the full name (John Smith) is much more desirable.

      In addition to being robust to noise, the system must also be able to adapt to user preferences. For example, some users may prefer abbreviated forms (e.g., KDD) instead of expanded forms (e.g., Conference on Knowledge Discovery and Data Mining). The system must be able detect and react to such preferences.


  • (Paskin et al., 2003) ⇒ Norman Paskin, Eamonn Neylon, Tony Hammond, and Sam Sun. (2003). “The “doi URI Scheme for the Digital Object Identifier (DOI)." Internet-Draft: draft-paskin-doi-uri-04.txt
    • In order to facilitate comparison of "doi" URIs and to reduce the risk of false negatives, normalization to the canonical form should be applied to minimize the amount of software processing for such comparisons.
    • The following normalization steps should be applied:
      • 1. Normalize the case of the leading "doi:" token to be lowercase
      • 2. Unescape all unreserved %-escaped characters
      • 3. Normalize the case of the scheme-specific part including any %-escaped characters to be uppercase
    • The following forms of a "doi" URI
      • 1. DOI:dk/P%C3%A6dagogi%2037(2),%20562
      • 2. doi:DK/P%C3%A6dagogi%2037(2),%20562
      • 3. doi:dk/P%c3%a6dagogi%2037(2),%20562
      • 4. doi:dk/p%c3%a6dagogi%2037(2),%20562
      • 5. doi:dk%2FP%C3%A6dagogi%2037%282%29%2C%20562
    • are normalized to the canonical form
      • doi:DK/P%C3%A6DAGOGI%2037(2),%20562


  • (Ravichandran and Hovy, 2002) ⇒ Deepak Ravichandran, and Eduard Hovy. (2002). “Learning Surface Text Patterns for a Question Answering System.” In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL 2002).
    • QUOTE: Canonicalization of words is also an issue. While giving examples in the bootstrapping procedure, say, for BIRTHDATE questions, the answer term could be written in many ways (for example, Gandhi’s birth date can be written as “1869”, “Oct. 2, 1869”, “2nd October 1869”, “October 2 1869”, and so on). Instead of enlisting all the possibilities a date tagger could be used to cluster all the variations and tag them with the same term.

      The same idea could also be extended for smoothing out the variations in the question term for names of persons (Gandhi could be written as “Mahatma Gandhi”, “Mohandas Karamchand Gandhi”, etc.).


  • (Lynch, 1999) ⇒ Clifford Lynch. (1999). “Canonicalization: A Fundamental Tool to Facilitate Preservation and Management of Digital Information.” In: D-Lib Magazine, Volume 5 Number 9.
    • QUOTE: Assume that we can define a canonical form for a class of digital objects that, to some extent, captures the essential characteristics of that type of object in a highly determined fashion. This form may be quite bulky and not necessarily reasonable for storing, transmitting, or manipulating objects. It’s an idealized form of the object, without regard to efficiencies. In addition, specific representations of the object may be richer than the canonical form. There may be a hierarchy of canonical forms, some of which are capable of representing much more detail or richer semantics than others (for example, ASCII text, Rich Text Format, and Word 98 format might be one such hierarchy).