The Index For The Nearest Neighbor Queries In High Dimensional Space

Posted on:2008-10-19

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J Q Zhang

Full Text:PDF

GTID:1118360215984462

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Many emerging database applications such as image, time series and scientific databases, manipulate high dimensional data. In these applications, one of the most frequently used and yet expensive operations is to find objects in the high-dimensional database that are similar to a given query object. Nearest neighbor search is a central requirement in such cases.There is a long stream of research on solving the nearest neighbor search problem, and a large number of multidimensional indexes have been developed for this purpose. However these indexes turn worse with the dimension growth, which is called dimensionality curse.In order to improve the query efficiency, K-means cluster approach is often used to estimate the data distribution in the context of high dimensional metric space index. But in previous work, the parameters of clustering are usually selected according to some heuristic manner. In this paper, we present a new high dimensional index approach - cluster splitting based high dimensional B⁺-tree. Through cluster splitting, the data space is partitioned more finely to reduce the cost of data access. The relationship between cluster and the query cost is discussed, and based on the query cost model, we present formulas to compute the "optimal" parameters of the cluster which can minimize the query cost in theory. Our experiment results show that the efficiency of our methods is better than iDistance, M-Tree and sequence scan, and the parameters computed by our formulas are very closed to the real optimal one.In order to improve the query answering of high-dimensional database, data distribution is necessary for seclecting appropriate indexing strategy. However, the traditional data distribution model can not estimate the accurate data distribution in the complex real multimedia data of image and video. This paper presents a method to estimate the accurate data distribution based on query sampling and proposes a novel hybrid index to speed up processing of high-dimensional K-nearest neighbor(KNN) queries. The proposed hybrid index improves the query efficiency by adaptively selection different index strategies for the data with different distribution. In the first step the cluster analysis and cluster splitting methods are applied to construct a tree-based index, then the relationship between data distribution and index performance is derived by sampling. At last some tree branches with sparse data are extracted for linear scan, while the aggregate data remains in the tree. Extensive experiments on five real image data sets show that the proposed hybrid index structure performs better than iDistance, M-Tree and linear scan, and scales better with dimensions. The index is still faster than linear scan when the dimension reaches 336. The experiments also show that the proposed query sampling algorithm can obtain the accurate data distribution when the amount of sampling is below N^1/2 (N is the size of data set).In general,index can not support the relevance feedback in the image retrieval process.In this dissertation,a novel algorithm supporting relevance feedback using the proposed high-dimensional index.

Keywords/Search Tags:

Nearest neighbor query, high dimensional index, sampling, cluster splitting

PDF Full Text Request

Related items

1	Research On The Visual Group K-Nearest Neighbor And Group Inverse K-Nearest Neighbor Query Of Multi-Source Objects In Three-Dimensional Space
2	Research Of High-Dimensional Space Query Algorithm Based On Space-Filling Curves
3	Research Of Local Sensitive Hash Index Based On Nearest Neighbor Graph
4	Spatio-temporal Database Based On The Grid Index Reverse Nearest Neighbor Query Technology
5	Research On Spatial Data Index Based On Nearest Neighbor Distance And Query Algorithms
6	Locality Sensitive Hashing Index Based On Neighborhood Collision Counting
7	Study On K Nearest Neighbor Query Based On Road Networks In Mobile Database
8	Research On High-dimensional Indexing In Content-Based Image Retrieval
9	Research Of Reverse Nearest Neighbor Query In Spatial Database
10	Research Of High-Dimensional Space Join And Query Algorithms Based On Main-Memory