Font Size: a A A

The Index For The Nearest Neighbor Queries In High Dimensional Space

Posted on:2008-10-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Q ZhangFull Text:PDF
GTID:1118360215984462Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Many emerging database applications such as image, time series and scientific databases, manipulate high dimensional data. In these applications, one of the most frequently used and yet expensive operations is to find objects in the high-dimensional database that are similar to a given query object. Nearest neighbor search is a central requirement in such cases.There is a long stream of research on solving the nearest neighbor search problem, and a large number of multidimensional indexes have been developed for this purpose. However these indexes turn worse with the dimension growth, which is called dimensionality curse.In order to improve the query efficiency, K-means cluster approach is often used to estimate the data distribution in the context of high dimensional metric space index. But in previous work, the parameters of clustering are usually selected according to some heuristic manner. In this paper, we present a new high dimensional index approach - cluster splitting based high dimensional B+-tree. Through cluster splitting, the data space is partitioned more finely to reduce the cost of data access. The relationship between cluster and the query cost is discussed, and based on the query cost model, we present formulas to compute the "optimal" parameters of the cluster which can minimize the query cost in theory. Our experiment results show that the efficiency of our methods is better than iDistance, M-Tree and sequence scan, and the parameters computed by our formulas are very closed to the real optimal one.In order to improve the query answering of high-dimensional database, data distribution is necessary for seclecting appropriate indexing strategy. However, the traditional data distribution model can not estimate the accurate data distribution in the complex real multimedia data of image and video. This paper presents a method to estimate the accurate data distribution based on query sampling and proposes a novel hybrid index to speed up processing of high-dimensional K-nearest neighbor(KNN) queries. The proposed hybrid index improves the query efficiency by adaptively selection different index strategies for the data with different distribution. In the first step the cluster analysis and cluster splitting methods are applied to construct a tree-based index, then the relationship between data distribution and index performance is derived by sampling. At last some tree branches with sparse data are extracted for linear scan, while the aggregate data remains in the tree. Extensive experiments on five real image data sets show that the proposed hybrid index structure performs better than iDistance, M-Tree and linear scan, and scales better with dimensions. The index is still faster than linear scan when the dimension reaches 336. The experiments also show that the proposed query sampling algorithm can obtain the accurate data distribution when the amount of sampling is below N1/2 (N is the size of data set).In general,index can not support the relevance feedback in the image retrieval process.In this dissertation,a novel algorithm supporting relevance feedback using the proposed high-dimensional index.
Keywords/Search Tags:Nearest neighbor query, high dimensional index, sampling, cluster splitting
PDF Full Text Request
Related items