2008 AUnifiedApproachforSchemaMatchi

Jump to: navigation, search

Subject Headings: Schema Matching Algorithm, Entity Record Deduplication Algorithm, Record Canonicalization Algorithm, Joint Inference Model, Information Integration, Discriminative Model.


Cited By



Author Keywords

Data Integration, Coreference, Schema Matching, Canonicalization, Conditional Random Field, Weighted Logic


The automatic consolidation of database records from many heterogeneous sources into a single repository requires solving several information integration tasks. Although tasks such as coreference, schema matching, and canonicalization are closely related, they are most commonly studied in isolation. Systems that do tackle multiple integration problems traditionally solve each independently, allowing errors to propagate from one task to another. In this paper, we describe a discriminatively-trained model that reasons about schema matching, coreference, and canonicalization jointly. We evaluate our model on a real-world data set of people and demonstrate that simultaneously solving these tasks reduces errors over a cascaded or isolated approach. Our experiments show that a joint model is able to improve substantially over systems that either solve each task in isolation or with the conventional cascade. We demonstrate nearly a 50% error reduction for coreference and a 40% error reduction for schema matching.

1. Introduction

As the amount of electronically available information continues to grow, automatic knowledge discovery is becoming increasingly important. Unfortunately, electronic information is typically spread across multiple heterogeneous resources (databases with different schemas, or web documents with different structures) making it necessary to consolidate the data into a single repository or representation before data mining can be successfully applied. However, data integration is a challenging problem. Even the task of merging two databases with similar schemas about the same realworld entities is non-trivial. An automatic system must be able to perform coreference (to identify duplicate records), canonicalization (to pick the best string representation of the duplicate record), and schema matching (to align the fields across schemas).

Coreference and other integration tasks have been studied almost exclusively in isolation, yet the individual problems are highly correlated. As an example, consider the two different data records of a person named John Smith in Table 1. Each data record is represented using a different schema. In this example, knowing that the Contact attribute from schema A maps to the Phone attribute from schema B provides strong evidence that the two mentions are coreferent, indicating that schema matching is a valuable precursor to coreference. However, knowing that the two John Smith mentions are coreferent provides strong evidence about which fields should be matched across the schemas (for example, the FirstName and LastName attributes of schema A should be mapped to the Name attribute of schema B). The high correlation of these two tasks indicate that a cascaded approach, where one task must be solved before the other, is likely to lead to gratuitous error propagation.

To motivate the idea further, consider the task of canonicalization, the process of creating a single standardized representation of a record from several different alternatives. The result of canonicalization on a set of records is a single record containing a high density of information about a real-world entity. Intuitively, these canonical representations of entities contain valuable evidence for coreference. We would like exploit this entity-level information, yet canonicalization assumes coreference has already been performed.

Table 1: Two records from different schemas representing the same John Smith |Schema A || Schema B |FirstName=John||Name=John Smith |MiddleName=R.||Phone=1-(222)-222-2222 |LastName=Smith|| |Contact=222-222-2222||

In this paper we investigate a unified approach to data integration that jointly models several tasks and their dependencies. More precisely, we propose a conditional random field for simultaneously solving coreference resolution, record canonicalization, and schema matching. As described in Section 3.1, one particular feature of our model is that it automatically discovers the top level canonical schema. We use first order logic clauses for parameter tying, effectively combining logic and probability in a manner similar to [24, 7, 19]. Exact inference and learning in these models are intractable, thus we present approximate solutions to both these problems. Our approximations prove to be effective allowing us to achieve almost a 50% reduction in error for coreference and a 40% error reduction in schema matching over non-joint baselines.

2. Related Work

2.1 Coreference Resolution

Coreference is a pervasive problem in data integration that has been studied in several different domains. The ACE and MUC corpora have helped initiate a line of research on newswire coreference, beginning with approaches that examine mention pairs [25, 18, 12] to more complicated models that reason over entire sets of mentions [7]. Person disambiguation, another form of coreference, has also been studied in detail in [21, 11, 10, 26, 14, 4, 1]. However, these works only resolve coreference between objects of the same representation (e.g., database schema). The coreference problem we tackle involves objects that contain different representations, making direct comparisons between these objects difficult (for example, we may not know in advance that Phone from schema A maps to Contact in schema B). The coreference portion of our model factorizes over sets of mentions and incorporates first order logic features making it most similar to Culotta et al. [7].

2.2 Canonicalization

Since records can refer to the same underlying entity in multiple ways (common aliases, acronyms and abbreviations), it is often necessary to choose a single and standardized representation when displaying the result to a user, or storing it compactly in a database. Additionally, because the canonical record is constructed from multiple records, it contains a high density of information about the entity, making it a convenient source of evidence for coreference resolution.

Canonicalization has played an important role in systems that perform coreference and database merging. Traditionally, it is performed post hoc and often relies on metrics for evaluating distances between strings. An example of canonicalization in a database merging task is Zhu and Unger [28], who obtain positive results by learning string edit parameters with a genetic algorithm. McCallum et al. [17] extend usual edit distance models with a conditional random field, demonstrating more accurate distance evaluations on several corpora; however, they do not apply their string distance model to the problem of canonicalization. Other approaches include Ristad and Yianilos [23], who use expectation maximization to learn the parameters of a generative model that defines a string in terms of the string edit operations required to create it. This work is extended and applied successfully to record deduplication by Bilenko and Mooney [2]. Recently, Culotta et al. [6] describe several methods for canonicalization of database records that are robust to noisy data and customizable to user preferences (e.g., a preference for acronyms versus full words).

2.3 Schema Matching

Schema and ontology mapping are fundamental problems in many database application domains such as data integration, E-business, data warehousing, and semantic query processing. In general we can identify two major challenges with the schema matching (and ontology mapping) problem: (1) structural heterogeneity and, (2) semantic heterogeneity. The former concerns the different representations of information where the same information can be represented in different ways. This is a common problem with heterogeneous databases. The latter deals with the intended meaning of the described information. More specifically we can identify the following differences between schemas [27]: (1) structural conflicts concerned with different semantic structures; (2) naming conflicts where different attribute names may be used for the same type of information, or the same name for slightly different types of information; (3) conflicts where different formats maybe used to represent the values of attributes (for example, different units, different precision, or different abbreviation styles). This problem has been extensively studied primarily by the database and machine learning communities [20, 13, 15, 8, 9, 27] (for a survey of the traditional approaches refer to [22]).

Our model is able to reason about all three different kinds of conflicts mentioned above, based on a set of first order logic features employed by the CRF modeling the task. Our approach also differs from previous systems in that schema matching is performed jointly along with coreference and canonicalization, resulting in a significant error reduction as we will see in Section 5. One important aspect of our model is that it will automatically discover the top level canonical schema for the integrated data as will be demonstrated in Section 3.1.

3. Problem Definition

We seek a general representation that allows joint reasoning over a set of tasks defined on a set of possibly heterogeneous objects in the world. In our unified data integration approach we aim for a representation that enables us to perform inference and learning over two different types of objects: (1) data records (in relation to the coreference resolution and canonicalization tasks); (2) schema attributes (in relation to the schema matching task). In abstract terms, our model finds solutions to this problem in terms of a set of partitions (clusters) of data records where all the records within a particular partition are coreferent and canonicalized; and also a set of partitions (clusters) of the schema attributes across different databases, where all the attributes within a schema partition are mapped together. As we will see shortly, although data record clusters and schema clusters are kept disjoint in terms of the type of objects they contain, they are tied together through a set of factors that compute different features associated with each task. Thus, inference is performed jointly over all three tasks.

4. Experiments

4.1 Data

Synthetic data

The synthetic data is generated from a small number (10) of user records collected manually from the web. These records use a canonical schema containing attributes such as first name, phone number, email address, job title, institution, etc. Next, we created three new schemas derived from the canonical schema by randomly splitting, merging, or noisifying the attributes of the canonical schema. For example, one schema would contain a Name field whereas another would contain two fields, FirstName and LastName, for the same block of information (perhaps dropping the middle name if it existed). In the training data we used the first two schemas and in the testing data we used one of the schemas from training, and also the third schema. This way we train a model on one schema but test it on another schema. For training we used a small number of user records that were similar in ways such that random permutations could make coreference, schema matching, and canonicalization decisions difficult. We first conformed the records to both schemas, and then made 25-30 copies of each data record for each schema while randomly permuting some of the fields to introduce noise. The testing data was created similarly, but for a different set of data records. The random permutations included abbreviating fields, deleting an entire field, or removing a random number of tokens from the front and/or the end of a field. The result of this was a large number of coreferent records, but with the possibility that the disambiguating fields between different records have been altered or removed.

Real World Data

For our real-world data we manually extracted faculty and alumni listings from 7 university websites, of which we selected 8 lists that had different schemas. The information in each schema contains basic user data such as first name, phone number, email address, job title, institution, etc., as well as fields unique to the academic domain such as advisor, thesis title, and alma mater. For each name that had an e-mail address we used our DEX [3] system to search the Internet for that person’s homepage, and if found we would extract another record. The DEX schema is similar to the university listings data, bringing the total number of different schemas to 9. Of the nearly 1400 mentions extracted we found 294 coreferent entities. Table 2 shows the DEX schema and two of the faculty listing schemas. There are several schema matching problems evident in table 2, for example Job Department from the UPenn schema is a superset of both of the Job Title fields. Another example, which occurs numerous times between other schemas, is where the pair of attributes "First Name, Last Name# from one schema is mapped to the singleton attribute Name that denotes the full name.

For each of the experiments we took all of the 294 coreferent clusters, and randomly selected 200 additional mentions that had no coreferent entities. This data was split into training and testing sets, where the only schema shared between training and testing was the DEX schema. This ensures that the possible schema matchings in the training set are disjoint from the possible schema matchings in the test set. The data provides us with a variety of schemas that map to each other in different ways, thus making an appropriate testbed for performing joint schema matching, coreference, and canonicalization.

4.2 Features

The features are first-order logic clauses that aggregate pairwise feature extractions. The types of aggregation depend on whether the extractor is real-valued or boolean. For real-valued features we compute the minimum, the maximum, and the average value over all pairwise combinations of records. For boolean-valued features we compute the following over all pairwise combinations of records: feature does not exist; feature exists; feature exists for the majority; feature exists for the minority; feature exists for all. Table 3 lists the set of features used in our experiments. In cases where we compute features between records in different schemas, sometimes we only compute the feature between fields that are aligned in the current schema matching. This is indicated by the Matched-fields only column in table 3. All of these features are used for coreference decisions, but only substring matching is used for schema matching.

4.3 Systems

We evaluate the following three systems with and without canonicalization for a total of six systems. Canonicalization is integrated with coreference inference by interleaving canonicalization with greedy agglomerative merges (every time a new cluster is created, a new canonical record is computed for it). Canonicalization has no affect on schema matching in isolation and is only relevant when coreference is involved.

5. Conclusions

In this paper we have demonstrated a successful method for performing coreference resolution, record canonicalization, and schema matching simultaneously with a conditional random field. Joint inference in this model outperformed cascaded and isolated approaches on both synthetic and real-world data despite having to use approximate inference and parameter estimation.

We believe a ripe area of future work would be to explore less greedy inference methods such as Metropolis-Hastings, simulated-annealing, or similar stochastic search methods. Additionally, rank-based learning may be combined with Metropolis-Hastings to train an even larger model that includes first-order logic clauses over entire clustering configurations. Finally, extending the model to include other tasks such as named entity recognition (NER) and record assembly is a natural next step. This could be particularly interesting because exact learning and inference for linear-chain CRFs (like those used in NER) is tractable, yet the other components of the model would have to be approximated.



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 AUnifiedApproachforSchemaMatchiMichael Wick
Khashayar Rohanimanesh
Karl Schultz
Andrew McCallum
A Unified Approach for Schema Matching, Coreference and CanonicalizationKDD-2008 Proceedingshttp://ciir-publications.cs.umass.edu/pdf/IR-684.pdf10.1145/1401890.14019772008