2013 SimilarityJoinsinRelationalData

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Quotes

Author Keywords

Abstract

State-of-the-art database systems manage and process a variety of complex objects, including strings and trees. For such objects equality comparisons are often not meaningful and must be replaced by similarity comparisons. This book describes the concepts and techniques to incorporate similarity into database systems. We start out by discussing the properties of strings and trees, and identify the edit distance as the de facto standard for comparing complex objects. Since the edit distance is computationally expensive, token-based distances have been introduced to speed up edit distance computations. The basic idea is to decompose complex objects into sets of tokens that can be compared efficiently. Token-based distances are used to compute an approximation of the edit distance and prune expensive edit distance calculations. A key observation when computing similarity joins is that many of the object pairs, for which the similarity is computed, are very different from each other. Filters exploit this property to improve the performance of similarity joins. A filter preprocesses the input data sets and produces a set of candidate pairs. The distance function is evaluated on the candidate pairs only. We describe the essential query processing techniques for filters based on lower and upper bounds. For token equality joins we describe prefix, size, positional and partitioning filters, which can be used to avoid the computation of small intersections that are not needed since the similarity would be too low.

Preface

During the last few decades database systems have evolved substantially and today it is common for database systems to manage a large variety of complex objects. At the physical level, there has been significant progress in terms of storing and processing such complex objects. At the logical level, however, complex objects remain a challenge. A key reason is that equality, which is appropriate for simple objects, is often ineffective for complex objects. In this book we describe the essential concepts and techniques toward a principled solution for processing complex objects that must be compared in terms of similarity rather than equality.

An intuitive approach to define the similarity of complex objects is the edit distance, i.e., the number of basic edit operations that are required to transform one object into another. The intuitive nature of the edit distance is the reason why edit distances have become the de facto standard for complex objects. We define the string and tree edit distance, give algorithms to compute these distances, and work out the essential properties that support the effective and efficient processing of complex objects.

Token distances, which decompose complex objects into sets of tokens, have been proposed to deal with the high computational cost of edit distances. The token distance is computed by comparing the token sets that are the result of the decompositions. The more similar two token sets are the smaller is their distance. Token sets are compared by counting the number of identical elements in the sets. Set intersection is an operation that is well supported by database systems and scales to large sets. We survey the different techniques to compute and process token sets, and we discuss in detail three representative decomposition techniques: strings with q-grams, ordered trees with pq-grams, and unordered trees with windowed pq-grams.

Determining the exact distance between complex objects, particularly for joins where all pairs of objects must be compared, is often too expensive. To reduce the costs of such computations, filter and refine approaches have been developed. The goal of the filter step is to cheaply identify candidate pairs for which the exact similarity must be computed. Non-candidate pairs do not have to be considered because their similarity is not sufficient to be included in the result. We describe various filter techniques, and provide lower and upper bounds that can be used to efficiently compute similarity joins.

The book uses strings and trees as representative examples of complex objects. The techniques discussed in this book, however, are general and are also applicable to other types of objects. In particular, graphs are an important data structure that recently have received a lot of attention, and for which edit- and token-based distances have been proposed. At the relevant places we provide references to the vibrant and emerging field of graphs in databases.

The book is intended as a starting point for researchers and students in the database field who would like to learn more about similarity in database systems. Throughout, we offer precisely defined concepts and properties, and we illustrate these with representative and carefully chosen examples. We made an effort to include precise definitions, theorems, and examples, but at the same time kept the description at a level that is understandable to a general audience with an academic background. Much of the material presented in this book has been used in courses taught during the last few years at the Free University of Bozen-Bolzano, Technische Universität München, University of Salzburg, and University of Zürich. Our warm thanks goes to the students who provided constructive feedback and helped to advance the material presented in this book.

Introduction

Many applications rely on database systems for storing and querying data. Most often these are extensions of relational database systems, which build on strong theoretical foundations and mature technology that has been developed and advanced for decades. Current database systems focus on exact queries to, e.g., look up customers by their social security number, compute average sales per department, or join tables on keys and foreign keys. Unfortunately, for complex data types likes strings or trees, such queries often lead to poor results since two data items may be different even if they represent the same real world object.

In the case of strings, differences may happen due to typos, varying coding conventions used by different users or companies (e.g., Transactions on Database Systems versus Trans. on Database Systems), errors introduced by the OCR software, poor quality entries from information retrieval software, etc.

In the case of trees, which store the hierarchical relationship between individual data items, content as well as structure may differ. A popular example is XML, which organizes data items in a tree structure. Data sources that store information about the same object may represent it differently. For example, both DBLP and SIGMOD Record store bibliographic data about scientific articles in XML, but they use different structures to represent them. Even if the structure of two XML sources is the same, one of the sources may store more information than the other source, for example, in optional elements and attributes. Although there is no exact match between the XML fragments that represent the same scientific article, their XML trees are likely to be more similar than the XML fragments of different articles.

1.1 APPLICATIONS OF SIMILARITY QUERIES

Joining XML about Music Albums As an application example consider an online database about music CDs that integrates data from two XML sources: a song lyric store and a CD warehouse. The integrated database will store the artists and songs of an album, information about individual songs such as the lyrics, guitar tabs, and information about the artists.

Figure 1.1 shows tree representations of two different XML documents. Both represent data about the same song album. Yet exact, ordered tree matching would not consider the items as the same for a number of reasons. The song lyric store has an element year that is absent from the CD warehouse. The CD warehouse has a price for the album. For one track the databases list different artists. Also the document order of elements differs, i.e., the two documents have different sibling orders.

1.2 EDIT-BASED SIMILARITY MEASURES

In order to answer similarity queries, the similarity between data items must be computed. Various similarity measures for strings and trees have been proposed. An interesting class of similarity measures are edit-based similarities.

Edit-based similarity measures express the difference between two objects by the number of basic edit operations that are required to transform one object into the other. The smaller the number of required edit operations, the more similar the objects are. For string data, for example, the canonical set of edit operations is a) deleting a character, b) inserting a character, and c) replacing a character. With this set of edit operations, two edit operations are required to transform police into olive (deleting p and replacing c by v). Since the similarity decreases with the number of edit operations, the alikeness of two objects is often expressed as a dissimilarity rather than a similarity, i.e., as a distance. In our example the distance between police and olive is two.

. …

Data Types

Edit-Based Distances

Token-Based Distances

Query Processing Techniques

Filters for Token Equality Joins

Conclusion

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2013 SimilarityJoinsinRelationalDataNikolaus Augsten
Michael H Böhlen
Similarity Joins in Relational Database Systems10.2200/S00544ED1V01Y201310DTM0382013