Record Canonicalization Algorithm

From GM-RKB
Jump to navigation Jump to search

A Record Canonicalization Algorithm is an Algorithm that can solve a Record Canonicalization Task and be implemented into a Record Canonicalization System.



References

2009

2008

2007

  • (Culotta et al., 2007) ⇒ Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, and Andrew McCallum. (2007). “Canonicalization of Database Records using Adaptive Similarity Measures.” In: Proceedings of KDD-2007.
    • Consider a research publication database such as Citeseer or Rexa that contains records gathered from a variety of sources using automated extraction techniques. Because the data comes from multiple sources, it is inevitable that an attribute such as a conference name will be referenced in multiple ways. Since the data is also the result of extraction, it may also contain errors. In the presence of this noise and variability, the system must generate a single, canonical record to display to the user.
    • Record canonicalization is the problem of constructing one standard record representation from a set of duplicate records. In many databases, canonicalization is enforced with a set of rules that place limitations or guidelines for data entry. However, obeying these constraints is often tedious and error-prone. Additionally, such rules are not applicable when the database contains records extracted automatically from unstructured sources.
    • Simple solutions to the canonicalization problem are often insufficient. For example, one can simply return the most common string for each field value. However, incomplete records are often more common than complete records. For instance, this approach may canonicalize a record as “J. Smith” when in fact the full name (John Smith) is much more desirable.
    • In addition to being robust to noise, the system must also be able to adapt to user preferences. For example, some users may prefer abbreviated forms (e.g., KDD) instead of expanded forms (e.g., Conference on Knowledge Discovery and Data Mining). The system must be able detect and react to such preferences.