Font Size: a A A

Approximate Nearest Neighbor Search Based On Based On Hashing And Clustering For High Dimensional Data

Posted on:2020-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:S W DengFull Text:PDF
GTID:2428330602950192Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,information technology such as the Internet,big data,multimedia,and cloud computing have gradually become key areas for national development.The information in various fields has expanded rapidly,the effective management and rapid query of massive high-dimensional data has become an important problem to be solved.Due to the "curse of dimensionality" phenomenon,the traditional indexing method in the high-dimensional case search performance is even worse than linear search.The approximate nearest neighbor search based on the hash method is an effective way to solve the problems.In addition to the rapid query of massive high-dimensional data,hash methods also have deep research significance in many large-scale data fields such as pattern recognition,computer vision,machine learning,and data mining.In the high-dimensional case,the hash method has the advantages of low storage cost and efficient retrieval,which makes the result of approximating the nearest neighbor search equal to the true nearest neighbor in many cases.According to the way the hash function is generated,the hash method is divided into two categories: data-independent hash and data-dependent hash.Locality Sensitive Hash is the mainstream data-independent method.The random projection is used to generate the hash function.The generated process is independent of the data distribution.The high-dimensional data is hashed to generate binary code for fast and efficient retrieval.Data-independent hashing is a hashing method based on the characteristics of data distribution.According to the attribute characteristics of the data in the dataset,a more efficient hash function is designed to make the high-dimensional data points in the original space map through the hash function to obtain more compact hash codes.The research of hash method is mainly divided into two aspects:(1)designing the hash function to obtain a hash code with compact,strong distance-preserving and high discrimination;(2)constructing an efficient index structure and designing a search algorithm.Firstly,this thesis analyzes the problems existing in the traditional indexing method in high-dimensional cases,summarizes the data-independent and data-dependent hashing methods,and analyzes the advantages of data-dependent hashing method in comparison with the data-independent hashing method in the nearest nearest neighbor search.Secondly,this thesis studies the approximate nearest neighbor search algorithm based on data-independent hash such as p-stable distribution.For the problem of locality sensitive hashing in retrieval precision and index structure,this thesis studies the combination of clustering and traditional hashing.By clustering to maintain the similarity between data objects,using a hash method to convert the data object into hash codes and constructing an inverted index,checking the bucket containing the query point and the data in the bucket adjacent to the query point,to achieve fast and efficient approximation nearest neighbor search.Then,because the retrieval precision of the data-independent hash method is limited,we introduce the data-dependent method of Neighbor Sensitive Hashing to further improve the accuracy of the approximate nearest neighbor search.The Neighbor Sensitive Hashing is the opposite of the traditional hash method.It does not choose to preserve the similarity of similar objects in Hamming space,but maximizes the Hamming distance between similar objects to better distinguish the neighbor data objects and improve the accuracy of approximate nearest neighbor search.Finally,for the selection of the central point existing in the Neighbor Sensitive Hashing method,this thesis proposes a Hypersphere Based Clustering Initialization method,which can better describe the entire distribution characteristics of the data set,so that the cluster center can cover as many data objects as possible across the entire dataset,which will be more suitable for the Neighbor Sensitive Hashing method to further improve the accuracy of the approximate nearest neighbor search.The experiments in this thesis prove the advantage of the Neighbor Sensitive Hashing method based on the hypersphere clustering initialization method on search performance compared with other methods.
Keywords/Search Tags:High-dimensional Data, Approximate Nearest Neighbor Search, Clustering, Neighbor Sensitive Hashing, Hypersphere Based Clustering Initialization
PDF Full Text Request
Related items