1996 X-Tree

From GM-RKB
Jump to navigation Jump to search

Subject Headings: High-Dimensional Data, X-Tree Index Data Structure.

Notes

Cited By

Quotes

Abstract

In this paper, we propose a new method for indexing large amounts of point and spatial data in high-dimensional space. An analysis shows that index structures such as the R*-tree are not adequate for indexing high-dimensional data sets. The major problem of R-tree-based index structures is the overlap of the bounding boxes in the directory, which increases with growing dimension. To avoid this problem, we introduce a new organization of the directory which uses a split algorithm minimizing overlap and additionally utilizes the concept of supernodes. The basic idea of overlap-minimizing split and supernodes is to keep the directory as hierarchical as possible, and at the same time to avoid splits in the directory that would result in high overlap. Our experiments show that for high-dimensional data, the X-tree outperforms the well-known R*-tree and the TV-tree by up to two orders of magnitude

1. Introduction

In many applications, indexing of high-dimensional data has become increasingly important. In multimedia databases, for example, the multimedia objects are usually mapped to feature vectors in some high-dimensional space and queries are processed against a database of those feature vectors [Fal 94]. Similar approaches are taken in many other areas including CAD [MG 93], molecular biology (for the docking of molecules) [SBK 92], string matching and sequence alignment [AGMM 90], etc. Examples of feature vectors are color histograms [SH 94], shape descriptors [Jag 91, MG 95], Fourier vectors [WW 80], text descriptors [Kuk 92], etc. In some applications, the mapping process does not yield point objects, but extended spatial objects in high-dimensional space [MN 95]. In many of the mentioned applications, the databases are very large and consist of millions of data objects with several tens to a few hundreds of dimensions. For querying these databases, it is essential to use appropriate indexing techniques which provide an efficient access to high-dimensional data. The goal of this paper is to demonstrate the limits of currently available index structures, and present a new index structure which considerably improves the performance in indexing high-dimensional data.

Our approach is motivated by an examination of R-tree-based index structures. One major reason for using R-tree-based index structures is that we have to index not only point data but also extended spatial data, and R-tree-based index structures are well suited for both types of data. In contrast to most other index structures (such as kdB-trees [Rob 81], grid files [NHS 84], and their variants [see e.g. SK 90]), R-tree-based index structures do not need point transformations to store spatial data and therefore provide a better spatial clustering.

Some previous work on indexing high-dimensional data has been done, mainly focussing on two different approaches. The first approach is based on the observation that real data in high-dimensional space are highly correlated and clustered, and therefore the data occupy only some subspace of the high-dimensional space. Algorithms such as Fastmap [FL 95], multidimensional scaling [KW 78], principal component analysis [DE 82], and factor analysis [Har 67] take advantage of this fact and transform data objects into some lower dimensional space which can be efficiently indexed using traditional multidimensional index structures. A similar approach is proposed in the SS-tree [WJ 96] which is an R-tree-based index structure. The SS-tree uses ellipsoid bounding regions in a lower dimensional space applying a different transformation in each of the directory nodes. The second approach is based on the observation that in most high-dimensional data sets, a small number of the dimensions bears most of the information. The TV-tree [LJF 94], for example, organizes the directory in a way that only the information needed to distinguish between data objects is stored in the directory. This leads to a higher fanout and a smaller directory, resulting in a better query performance.

We implemented the X-tree index structure and performed a detailed performance evaluation using very large amounts (up to 100 MBytes) of randomly generated as well as real data (point data and extended spatial data). Our experiments show that on high-dimensional data, the X-tree outperforms the TV-tree and the R*-tree by orders of magnitude (cf. section 4). For dimensionality larger than 2, the X-tree is up to 450 times faster than the R*-tree and between 4 and 12 times faster than the TV-tree. The X-tree also provides much faster insertion times (about 8 times faster than the R*-tree and about 30 times faster than the TV-tree). … 1. According to [BKSS 90], the R*-tree provides a consistently better performance than the R-tree [Gut 84] and R+-tree [SRF 87] over a wide range of data sets and query types. In the rest of this paper, we therefore restrict ourselves to the R*-tree.


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1996 X-TreeHans-Peter Kriegel
Stefan Berchtold
Daniel A. Keim
The X-tree: An Index Structure for High-Dimensional Datahttp://infovis.uni-konstanz.de/papers/1996/VLDB96.pdf