Font Size: a A A

Locality Sensitive Hashing Index Based On Neighborhood Collision Counting

Posted on:2019-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2428330572459003Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Data is not only an expression of information,but also a carrier of potential information.Low-dimensional data sets are easily queried by traditional database methods.In recent years,digitalization and networking have brought about a boom in data,and the proportion of image and video data in the entire big data has approached 90%.Unstructured data sets have resulted in a huge scale of eigenvectors and feature vector sets.Dimensional data brings many challenges and opportunities to database management.Among them,the indexing and querying of massive high-dimensional data is a very important issue that needs to be solved.Therefore,based on the high-dimensional Nearest Neighbor query count has been widely concerned and continuous optimization.One of its advantages lies in the highefficiency storage method and the high-rate index structure,and on the other hand it guarantees good accuracy in many situations.In practical application,it has become one of the basic technologies to solve the research of massive high-dimensional data retrieval.The methods to solve the massive high-dimensional data query problems can be divided into two categories: First,Locality Sensitive Hashing(LSH)and its variants,mapping highdimensional features into hash codes through random mapping or projection.Generate hash functions.In this way,the probability that two adjacent points of the original data are adjacent in the new space is still large,and the probability that the adjacent points are mapped into the same space is low.The methods to solve the massive high-dimensional data query problems can be divided into two categories: First,Locality Sensitive Hashing(LSH)and its variants,mapping high-dimensional features into hash codes through random mapping or projection.Generate hash functions.In this way,the probability that two adjacent points of the original data are adjacent in the new space is still large,and the probability that the nonadjacent points are mapped into the same space is low.Second,the Nearest Neighbor Graph(NNG)randomly selects one point,makes full use of the neighborhood relationship between data points,compares the candidate points with their neighbors by constructing a graph structure to provide a convergence path for neighbors,and iteratively updates the candidate points.This article first focuses on the local sensitive hash counts and neighbor graphs and analyzes them.In a local sensitive hash based on p-stable distribution,different hashing methods use different strategies to reorganize the hashing code,but these hashing reorganization strategies make the hashing function static in order,which may cause the algorithm to reject certain neighbors as candidates.Collision Counting Locality-Sensitive Hashing(C2LSH)uses a very flexible dynamic counting strategy to select candidate points,making it easier for neighbors to be selected as candidate points.It is worth noting that in C2 LSH,the number of hash functions chosen is not affected by the dimensionality of the data object,which makes C2 LSH particularly suitable for high-dimensional NN searches.The efficiency gain is obtained by approximating the nearest neighbor with a small amount of precision.In order to further optimize the efficiency and accuracy of local sensitive hashing of collision counts,this paper proposes to update candidate point sets in two different ways: collision counts and neighbor graphs,and neighbors may also be neighbors that may be a good neighbor.Domain voting,recounting and optimizing candidate point sets.Achieving approximate nearest neighbor query with a high-precision but fewer preferred candidate set points allows hash coding to have better discrimination of data objects that are closer to the query.The neighborhood collision hash is combined with the nearest neighbor graph to achieve a fast approximate nearest neighbor query of high dimensional data with a high recall rate and accuracy.This article through a large number of experiments on different high-dimensional data sets,the experimental results show that the local sensitive hash index based on the neighborhood collision count and the current popular LSH method has certain advantages in the performance,but also proved that the coding generated by the local sensitive hash can further improve the accuracy and art.
Keywords/Search Tags:Locality Sensitive Hashing, Approximate Nearest Neighbor Search, Index Structure for High-Dimensional Data, Neighbor voting, Nearest Neighbor Graph
PDF Full Text Request
Related items