Font Size: a A A

Hash-based Approximate Nearest Neighbor Search For High-dimensional Data

Posted on:2018-06-15Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ZhangFull Text:PDF
GTID:2348330521950953Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The developing of big data applications has brought new challenges and opportunities to database management.Especially,retrieving for high dimensional data as a crucial problem is urgent to be solved.Recent years,approximate nearest neighbor(ANN)search methods based on hashing technique have caused wide attention.Hashing has been acknowledged as one of key technique to solve the large scale of high dimensional data problem in practical application,due to its high efficiency of storage and rapid retrieval.Hash-based methods solving the approximate nearest neighbors searching problem can be divided into two main categories:firstly,locality sensitive hashing(LSH),which maps a high-dimensional feature vector into a hash code by several random projections,and designs hashing functions without exploring the data distribution;secondly,data-dependent hashing,which learns hash functions according to data distribution,and generates more compact distinguishing and distance-preserving hash codes.The main research efforts along hash-based methods consist of(1)designing efficient hash functions and generating more distinguishing and distance-preserving hash codes;(2)constructing an effective index structure based on hash codes,and implementing an efficient searching algorithm.This paper firstly focuses on the approximate nearest neighbor search algorithms using LSH function based on-stable distribution,such as LSB,C2LSH,SK-LSH and SRS.Among the four methods,LSB,SK-LSH and SRS respectively re-organize the hash codes using different strategies,and adapt the new hash keys into the efficient index structures for low-dimensional data(like B-tree,B~+tree,R-tree and so on),to reduce the cost of I/O and time during the searching progress.However,this kind of strategies for re-organizing hash codes must fix the order of hash functions,which may result that the searching algorithm reject some neighbors as the candidates.C2LSH adopts the flexible strategy of dynamic collision counting to collect candidates,which allows the neighbors more likely to stay in the candidate set.But there are few index structures in external memory for dynamic collision counting.Due to low efficiency of dynamic collision counting in external memory,we propose to change the original LSH function based on-stable distribution into binary LSH.The superiority of binary LSH in the accuracy for the approximate nearest neighbor search is further explored from both theoretical computation and experimental results.Then,we can convert the dynamic collision counting problem into hamming searching problem,and establish a new index structure,which solves the approximate nearest neighbor search problem with high accuracy and requires only a limited number of random I/O.To improve the accuracy of the searching algorithm further,this paper induces a more efficient data-dependent hash-based algorithm,neighbor-sensitive hashing(NSH),which can distinguish the near neighbors of the query point more accurately.The idea behind NSH abandons the widely-adopted principle of capturing distance properties of all pair-wise objects,aims to generate more effective hash functions for approximate nearest neighbors search tasks.Then,combine the neighbor-sensitive hashing with multi-indexing hashing,and implement fast search algorithm.The experimental results in this paper have proved the superiority of Binary LSH in the searching performance,and also that the hash codes generated by NSH can improve the accuracy of ANN search.
Keywords/Search Tags:Locality Sensitive Hashing, Approximate Nearest Neighbor Search, Index Structure for High-Dimensional Data, Neighbor-Sensitive Hashing, Multi-Indexing Hashing
PDF Full Text Request
Related items