Approximate Nearest Neighbor Search Based On Based On Hashing And Clustering For High Dimensional Data

Posted on:2020-02-24

Degree:Master

Type:Thesis

Country:China

Candidate:S W Deng

Full Text:PDF

GTID:2428330602950192

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology,information technology such as the Internet,big data,multimedia,and cloud computing have gradually become key areas for national development.The information in various fields has expanded rapidly,the effective management and rapid query of massive high-dimensional data has become an important problem to be solved.Due to the "curse of dimensionality" phenomenon,the traditional indexing method in the high-dimensional case search performance is even worse than linear search.The approximate nearest neighbor search based on the hash method is an effective way to solve the problems.In addition to the rapid query of massive high-dimensional data,hash methods also have deep research significance in many large-scale data fields such as pattern recognition,computer vision,machine learning,and data mining.In the high-dimensional case,the hash method has the advantages of low storage cost and efficient retrieval,which makes the result of approximating the nearest neighbor search equal to the true nearest neighbor in many cases.According to the way the hash function is generated,the hash method is divided into two categories: data-independent hash and data-dependent hash.Locality Sensitive Hash is the mainstream data-independent method.The random projection is used to generate the hash function.The generated process is independent of the data distribution.The high-dimensional data is hashed to generate binary code for fast and efficient retrieval.Data-independent hashing is a hashing method based on the characteristics of data distribution.According to the attribute characteristics of the data in the dataset,a more efficient hash function is designed to make the high-dimensional data points in the original space map through the hash function to obtain more compact hash codes.The research of hash method is mainly divided into two aspects:(1)designing the hash function to obtain a hash code with compact,strong distance-preserving and high discrimination;(2)constructing an efficient index structure and designing a search algorithm.Firstly,this thesis analyzes the problems existing in the traditional indexing method in high-dimensional cases,summarizes the data-independent and data-dependent hashing methods,and analyzes the advantages of data-dependent hashing method in comparison with the data-independent hashing method in the nearest nearest neighbor search.Secondly,this thesis studies the approximate nearest neighbor search algorithm based on data-independent hash such as p-stable distribution.For the problem of locality sensitive hashing in retrieval precision and index structure,this thesis studies the combination of clustering and traditional hashing.By clustering to maintain the similarity between data objects,using a hash method to convert the data object into hash codes and constructing an inverted index,checking the bucket containing the query point and the data in the bucket adjacent to the query point,to achieve fast and efficient approximation nearest neighbor search.Then,because the retrieval precision of the data-independent hash method is limited,we introduce the data-dependent method of Neighbor Sensitive Hashing to further improve the accuracy of the approximate nearest neighbor search.The Neighbor Sensitive Hashing is the opposite of the traditional hash method.It does not choose to preserve the similarity of similar objects in Hamming space,but maximizes the Hamming distance between similar objects to better distinguish the neighbor data objects and improve the accuracy of approximate nearest neighbor search.Finally,for the selection of the central point existing in the Neighbor Sensitive Hashing method,this thesis proposes a Hypersphere Based Clustering Initialization method,which can better describe the entire distribution characteristics of the data set,so that the cluster center can cover as many data objects as possible across the entire dataset,which will be more suitable for the Neighbor Sensitive Hashing method to further improve the accuracy of the approximate nearest neighbor search.The experiments in this thesis prove the advantage of the Neighbor Sensitive Hashing method based on the hypersphere clustering initialization method on search performance compared with other methods.

Keywords/Search Tags:

High-dimensional Data, Approximate Nearest Neighbor Search, Clustering, Neighbor Sensitive Hashing, Hypersphere Based Clustering Initialization

PDF Full Text Request

Related items

1	Hash-based Approximate Nearest Neighbor Search For High-dimensional Data
2	Approximate Nearest Neighbor Search For High-Dimensional Based On Nearest Neighbor Graph
3	Locality Sensitive Hashing Index Based On Neighborhood Collision Counting
4	Research Of Local Sensitive Hash Index Based On Nearest Neighbor Graph
5	Efficient Algorithms For Approximate Aggregation And Nearest Neighbor Queries Over Multi-Dimensional Data
6	Research Of Approximate Nearest Neighbor Search Based On Locality Sensitive Hashing
7	Research On Approximate Nearest Neighbor Search And Maximum Inner Product Search For High-dimensional Dataset
8	Research On Local Sensitive Hashing And Approximate Nearest Neighbor Algorithm
9	Reasearch On Locality Sensitive Hashing Based Approximate Nearest Neighbor(s) Searching Algorithm
10	LSH-based Similarity Search Of Web Data