Font Size: a A A

Research Of High Dimensional Retrival Algorithms Based On Adaptive Cluster Distance Bounding

Posted on:2013-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:J GuoFull Text:PDF
GTID:2248330395473355Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the fast development of the high dimensional databases, the size of the high dimensional database is becoming increasing. Now, reasearchers have proposed a lot of high dimensional index structures in order to accelerate the query efficiency. But most of these structures have two defects as follows:first, most of them are based on an unreasonable assumption that data are uniformly distributed and have no connections among dimensions. Second, the query efficiency based on these structures are becoming lower with the increase of the dimension size and a research has improved that the "curse of dimensionality" occurs when the dimensionaliry is more than10. A popular and effective technique to overcome the curse of dimensionality is the vector approximation file, but the acceleration performance of it is limited in a small range, moreover, this structure is also based on the assumption that the data are uniformaly distributed, with independent attributes. All these cannot satisfy the practical use.The adptive cluster distance bounding (ACDB) method for high-dimensional indexing is based on the assumption that the data are non-uniformly distributed and have some connections among dimensions for real datasets. Comparied with other high dimensional index structures, the method can solve practical problems better. Moreover, the indexing method is using the Voronoid clustering, which allows for more general convex polygon structures that are able to efficiently bound the cluster surface. In2011, some researchers proposed the KNN algorithm based on ACDB. This algorithm has reduced the I/O times by filtering out irrelevant clusters though a lower value of the query-cluster distance. Reasearchers have proved that this algorithm based on ACDB is better at reducing the I/O numbers and the query performance than the sequencial query, such as VA-File, iDistance, LDC and so on. But with the increasement of the dimensionality and data size, the query efficiency is becoming lower and lower, so the KNN algorithm based on ACDB is still needed to be optimized. Hence, our main work and results are as follows: First, we investigated the-state-of-art algorithms about the high-dimensional indexing in the field of ACDB, and found the defects in the KNN algorithm based on ACDB.Second, with extra storage requirements and pretreatment, we proposed an IV-KNN search algorithm by reducing unnecessary distance calculations and CPU costs according to the triangle inequality formula. Experiment results based on real data sets proved that the efficiency of IV-KNN search algorithm is much better than V-KNN.Finally, to analyze the current approximate similarity retrival methods, we propose a RDFV-KNN algorithm by both applying the PCA and LLE dimension reduction technology. From experiments of real datasets, the efficiency and accuracy of RDIV-KNN algorithm has been improved straightforward.
Keywords/Search Tags:high-dimensional indexing structure, adaptive cluster distance bounding, KNNquery, approximate similarity search, dimension reduction
PDF Full Text Request
Related items