2004 InstanceBasedSchemaMatchingForWebDBs

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Schema Matching Algorithm.

Notes

Cited By

Quotes

Abstract

  • In a Web database that dynamically provides information in response to user queries, two distinct schemas, interface schema (the schema users can query) and result schema (the schema users can browse), are presented to users. Each partially reflects the actual schema of the Web database. Most previous work only studied the problem of schema matching across query interfaces of Web databases. In this paper, we propose a novel schema model that distinguishes the interface and the result schema of a Web database in a specific domain. In this model, we address two significant Web database schema-matching problems: intra-site and inter-site. The first problem is crucial in automatically extracting data from Web databases, while the second problem plays a significant role in meta-retrieving and integrating data from different Web databases. We also investigate a unified solution to the two problems based on query probing and instance-based schema matching techniques. Using the model, a cross validation technique is also proposed to improve the accuracy of the schema matching. Our experiments on real Web databases demonstrate that the two problems can be solved simultaneously with high precision and recall.

6. Related Work

  • Schema matching is a basic problem in database research with numerous techniques proposed to address the problem (see [11] and [22] for surveys). Existing work that addresses the problem of automatic schema matching for Web databases adopts the prior techniques on matching schemas of traditional databases. [16] presented a statistical approach to integrate the interface schemas of Web databases in the same domain. It hypothesizes that given Web databases in the same domain, the aggregate vocabulary describing the interface input elements tends to have a relatively small size. Furthermore, there exists a unified hidden schema underlying these interfaces. A statistical probability model is employed to find the hidden schema by the co-appearance of attribute names. The schema matching methods employed are label-based.
  • [17] introduced a tool, WISE-Integrator, that performs automatic integration of Web search interfaces in a product domain. WISE-Integrator employs comprehensive meta-data, such as element labels and default value of the elements, to automatically identify matching attributes from different search interfaces.
  • [19] investigated algorithms for generic schema matching, outside of any particular data model or application. An algorithm called Cupid was proposed to discover mappings between schema elements based on their names, data types, constraints, and schema structure.
  • [18] used a classifier to categorize attributes according to their field specifications and data values, and then train a neural network to recognize similar attributes. However, this method may not be applicable for Web databases since both field specifications and data values are incomplete in many cases.
  • [11] developed the COMA schema-matching system as a platform to combine multiple matchers in a flexible way. While their approach may seem similar to our cross validation method, it is fundamentally different since the goal of our method is the reinforcement of multiple matchers, not the straightforward combination of the matchers.
  • [21] presented HiWe, a prototype deep-web crawler that can extract the labels of interface elements and automatically submit queries through the elements. Interface elements with the same/similar labels are matched in order to obtain each other’s domain values for automatic query submission.
  • The main difference between our work and previous work is that we aim to provide a general framework for schema matching of Web databases. To the best of our knowledge, no previous work has presented such a framework, especially the combined schema model. Moreover, the instance-based schema-matching method is seldom used for schema matching in the Web database context since it is hard to get instances from Web databases. Supplied with a set of sample instances, our work proves that instance-based methods can also be very effective for Web database schema matching.

7. Conclusion

matching for Web databases. We propose a combined schema model to describe the various schemas associated with a Web database and a generative view to include five kinds of schema matching of related Web databases in a specific domain.

  • In the combined schema model, we address two

significant schema-matching problems for Web databases, intra-site schema matching and inter-site schema matching. We then investigate a unified solution to the two problems based on domain-specific query probing and attribute content overlap. Our instance-based approaches, which adopt the mutual information concept and vector similarity analysis, are quite powerful for precisely identifying the matching relationships among attributes of Web databases’ interface and result schemas. Benefiting from our general framework, a cross validation technique, converted to a graph-partitioning problem, is introduced and shown to improve the matching performance.

involvement to provide a precise global schema and instance samples. One direction to extend this work is to adopt automatic global schema generation techniques to make the whole system fully automatic. Another direction of improvement is to combine our work with previous label-based approaches to build a more robust matching system. In addition, we plan to extend this work to handle not only 1:1 mappings but also 1:N mappings over Web database schema attributes.


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 InstanceBasedSchemaMatchingForWebDBsJi-Rong Wen
Wei-Ying Ma
Jiying Wang
Fred Lochovsky
Instance-based Schema Matching for Web Databases by Domain-specific Query Probinghttp://portal.acm.org/citation.cfm?id=1316726