2007 CanonicalizationOfDatabaseRecords

(Culotta et al., 2007b) ⇒ Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, and Andrew McCallum. (2007). “Canonicalization of Database Records using Adaptive Similarity Measures.” In: Proceedings of of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2007) [doi:10.1145/1281192.1281217].

Subject Headings: Record Canonicalization Algorithm.

Notes

Cited By

Quotes

Abstract

It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from online papers. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as canonicalization. Despite its importance, there is little existing work on canonicalization.

In this paper, we explore the use of edit distance measures to construct a canonical representation that is “central” in the sense that it is most similar to each of the disparate records. This approach reduces the impact of noisy records on the canonical representation. Furthermore, because the user may prefer different styles of canonicalization, we show how different edit distance costs can result in different forms of canonicalization. For example, reducing the cost of character deletions can result in representations that favor abbreviated forms over expanded forms (e.g. KDD versus Conference on Knowledge Discovery and Data Mining). We describe how to learn these costs from a small amount of manually annotated data using stochastic hill-climbing. Additionally, we investigate feature-based methods to learn ranking preferences over canonicalizations. These approaches can incorporate arbitrary textual evidence to select a canonical record. We evaluate our approach on a real-world publications database and show that our learning method results in a canonicalization solution that is robust to errors and easily customizable to user preferences.

Introduction

Consider a research publication database such as Citeseer or Rexa that contains records gathered from a variety of sources using automated extraction techniques. Because the data comes from multiple sources, it is inevitable that an attribute such as a conference name will be referenced in multiple ways. Since the data is also the result of extraction, it may also contain errors. In the presence of this noise and variability, the system must generate a single, canonical record to display to the user.

Record canonicalization is the problem of constructing one standard record representation from a set of duplicate records. In many databases, canonicalization is enforced with a set of rules that place limitations or guidelines for data entry. However, obeying these constraints is often tedious and error-prone. Additionally, such rules are not applicable when the database contains records extracted automatically from unstructured sources.

Simple solutions to the canonicalization problem are often insufficient. For example, one can simply return the most common string for each field value. However, incomplete records are often more common than complete records. For instance, this approach may canonicalize a record as “J. Smith” when in fact the full name (John Smith) is much more desirable.

In addition to being robust to noise, the system must also be able to adapt to user preferences. For example, some users may prefer abbreviated forms (e.g., KDD) instead of expanded forms (e.g., Conference on Knowledge Discovery and Data Mining). The system must be able detect and react to such preferences.

In this paper, we first formalize the canonicalization problem and then propose three solutions. The first uses string edit distance to determine which record is most central in a set of records. This approach can mitigate the noise contained in outlying records. To enable the system to adapt to user preferences, the second solution optimizes the edit distance parameters using human-labeled data. Finally, we describe a feature-based solution that can flexibly incorporate textual evidence to select a canonical record. We again estimate the parameters of this method from labeled data. We show that this problem can be more naturally formulated as a ranking task, rather than a classification task, and modify the learning methods accordingly.

We perform several empirical comparisons of these approaches on publication records culled from the Web. The results indicate that the feature-based approach significantly outperforms competing approaches as well as a number of simpler baselines. Furthermore, we show that the parameters of the feature-based approach can be estimated from just a few training examples, and is quite robust to noise that is common in automatically generated databases.

2. Related Work

While there has been little work explicitly addressing canonicalization, the idea has been present in many application and research areas. In this section we review several of these applications as well as related work in the area of learning string edit distance costs.

Tejada et al. [13] devise a system to automatically extract and consolidate information from multiple sources into a unified database. When a user queries this database, multiple representations of an attribute are inevitable due to naming inconsistencies across the various sources from which they were drawn. Although object deduplication is the primary goal of the that research, canonicalization arises when the system presents results to the user. Tejada et al. propose ranking the strings for each attribute based on the user’s confidence in the data source from which the string was extracted. One difficulty in this approach is that if the data is extracted from a large number of sources, a non-trivial burden is placed on the users, who may not have the expertise to express preferences about each data source. Additionally, the database must store source-specific meta information for each string. Our canonicalization methods are adaptable to any database, regardless of whether the source of the information is available. Additionally, we enable the user to express preferences independent of the data source.

Other work has focused on learning the parameters of string edit distance with encouraging results. Zhu and Unger [15] apply string edit distances to the task of merging database records. They observe that parameters cannot be optimized individually due to the complex interaction between various edit cost weights on the outcome. Additionally they note that greedy methods are too likely to converge prematurely in local optima and that random restarts are unnecessarily expensive. Instead they propose a genetic algorithm to learn the weights of each cost and find that it stabilizes after 100 generations. In lieu of genetic approaches we propose learning the edit costs using either stochastic search, or an exhaustive search over a relatively small discrete space of possible parameter settings.

McCallum et al. [9] also use a discriminatively learned edit distance to perform record deduplication. They extend Zhu and Unger’s work by using a conditional random field to learn the costs of a variety of flexible edit distance operations. However, they do not explore canonicalization. Ristad and Yianilos [12] learn a probability distribution over atomic string edit operations (insertion, deletion, substitution) and define a stochastic transducer that defines the probability of a string as either the Viterbi sequence of edit operations or the sum of all possible sequences of edit operations required to produce that string. The parameters of this generative model are learned using the expectation maximization (EM) algorithm.

Bilenko and Mooney [1] present a method to learn edit distance based similarity measures of each attribute between records in order to perform deduplication. They extend the work of Ristad and Yianilos [12] by accommodating affine gaps. In similar fashion, the weights are learned iteratively with EM.

Our learning methods differ from those outlined in Ristad and Yianilos [12] and Bilenko and Mooney [1] in that we are not concerned with learning a generative model. We propose two methods to learn edit distance parameter settings using stochastic or exhaustive search. Additionally, we propose two feature-based methods that combine the outputs of multiple parameter settings (i.e., multiple edit distance models) and other sources of textual evidence into a discriminative model of canonicalization.

Canonicalization has also been implicitly considered in deduplication research. For example, Milch et al. [11] and McCallum and Wellner [10] propose deduplication models containing variables for canonical attributes of a record. The variables are used to help deduplication, although the accuracy of the resulting canonical records is not optimized or evaluated.

Recently, frameworks have been proposed to handle uncertainty in databases, particularly those containing automatically extracted records. One approach when consolidating extractions from various sources is to perform some type of weighted voting to determine which facts should be inserted in the database [8]. Another approach is to store the uncertainty of these extractions directly in the database. This can be accomplished by storing the top [math]\displaystyle{ n }[/math] most confident extraction results (along with corresponding probabilities or confidence measures) for each desired record. Gupta and Sarawagi [5] leverage confidence value outputs from the extraction models to improve query results on databases containing uncertainty. Fundamentally the problem is canonicalization because the system is faced with a choice when presenting multiple query results with various confidence values to the user. It is analogous to our canonicalization task except that we do not have the luxury of confidence values. While the inclusion of such values is clearly beneficial, we propose methods that achieve canonicalization in absence of such information (and often this information is strictly unavailable).

3. Problem Definition

Let a record [math]\displaystyle{ R }[/math] be a set of fields, R = {F1 . . . Fp}. Let field Fi be an attribute-value pair <ha, vi>. Table 1 shows three example records.

Databases constructed from multiple sources often accumulate multiple versions of the same record. Record deduplication is the task of detecting these different versions.

Table 1 shows three records that have been predicted to be duplicates. In fact, record (c) refers to a book chapter version, whereas (a) and (b) refer to conference proceedings.

Record deduplication is a difficult problem that has been well-studied. However, in this paper we are interested in a subsequent step: how to present the user one canonical representation of a record.

We define the canonicalization problem as follows: Given a set of duplicate records R = {R1 . . .Rk}, create a canonical record R* that summarizes the information in R. We refer to the canonicalization operation as C(R).

Note that it is not always clear what the optimal canonicalized record should be. Indeed, different users may prefer different forms of canonicalization. For example, some users may prefer the abbreviated conference string IJCAI, while others may prefer the expanded string International Joint Conference on Artificial Intelligence. However, there are a few desiderata of a good canonicalization:

Error-free: The canonical record should not contain errors, such as misspellings or incorrect field values. This is especially a concern when the data has been automatically extracted from noisy sources (e.g. when text must be extracted from a PDF file and field assignments are automated). In these cases, there may exist outlying records that contain erroneous data. The canonicalizer should attempt to minimize the effect of these errors.

Complete: The canonical record should contain all the accurate information contained in the duplicate records. Thus, even if not all records contain a date field, the field should be included in the canonical record.

Representative: The canonical record should reflect the commonality among the duplicate records. Thus, the canonical record should in some sense be similar to all of the duplicate records.

…

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2007 CanonicalizationOfDatabaseRecords	Aron Culotta Michael Wick Robert Hall Matthew Marzilli Andrew McCallum			Canonicalization of Database Records using Adaptive Similarity Measures		Proceedings of of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining	http://www.cs.umass.edu/~mwick/MikeWeb/Publications files/culotta07canonicalization.pdf	10.1145/1281192.1281217		2007