Font Size: a A A

Research Of Approximate K-Nearest Neighbors Search Algorithm Based On Locality Sensitive Hashing

Posted on:2015-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:W M QiuFull Text:PDF
GTID:2298330467986840Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In recent years, with the continuous and fast development of network information retrieval technology, especially many applications witness a quick increase in the amount and dimension of data to be processed. How to efficiently deal with large amounts of high-dimensional data search problem becomes an important research topic. Due to the "curse of dimensionality", many tree-based index structure and its variants index methods haven’t meet the requirements of users, these tree-based methods become slower than the brute-force approach.LSH (Locality Sensitive Hashing) is the most popular and suitable for approximate similarity search in high-dimensional data space. However, a significant drawback of this approach is the requirement for many hash tables in order to promise good search quality, resulting in high space cost and low time efficiency. Moreover, as the amount of data scale is gradually increasing and MapReduce is gradually widely used, traditional centralized methods haven’t the ability to deal with massive data, in order to solve these problems, the following work has been done in this paper:(1) To overcome the problem of high space cost and low search efficiency, we propose a new two-level hybrid schema, called LSRP-tree, which firstly divides the dataset into many subgroups with a RP-tree structure, and then construct LSH hash tables for each subgroup. Based on the two-level schema and fully exploit the feature of hash function collision, we propose two different approximate similarity search algorithms, separately called CCP and CCF, which efficiently perform approximate k-Nearest Neighbors in high dimensional space. Compared with LSB-tree/LSB-forest method, our methods show better performance than baseline methods in terms of search efficiency, search quality and space cost.(2) In order to solve the problem of insufficient ability to deal with massive data, this paper studies the LSH-based k-Nearest Neighbors search algorithm with the popular mapreduce programming model. We propose a novel LSH-based distributed inverted index scheme and design an efficient search algorithm, called H-c2kNN. At last, we implement our methods and conduct many experiments on real and synthetic data set, the results show that our proposed approach gains good performance and high scalability.
Keywords/Search Tags:Locality sensitive hashing, k-Nearest Neighbors, high dimensional data
PDF Full Text Request
Related items