Font Size: a A A

Research On Similarity Search Of High Dimensional Data Based On Hash Technique

Posted on:2021-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:B FangFull Text:PDF
GTID:2428330611998845Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Ubiquitous network environment is full of a large number of high-dimensional data,such as audio,video,pictures and so on.The traditional linear search and tree search methods can no longer meet the need of fast similarity search for high-dimensional data.The similar search algorithm based on hash technology proposed in recent years can solve the ”dimensional disaster” problem that traditional search algorithms encounter when facing high-dimensional data.Nowadays,the most popular algorithm based on hash technique is the Locality-sensitive hashing(LSH)algorithm based on p-stable distribution.However,the current position-sensitive hash framework based on p-stable distribution has two shortcomings.First,it requires a large number of hash tables to ensure the performance of similar queries,resulting in an excessive index space footprint.Secondly,the lack of its search strategy makes the search efficiency is not high.Based on the analysis of the traditional position-sensitive hash algorithm,this paper studies and improves the existing calculation model and frame based on LSH hash index.This paper proposes a Locality-sensitive hashing algorithm based on hash bucket heap sort(HSLSH).Firstly,this algorithm adopts the strategy of single hash table and avoids the problem that the index space occupies too much memory due to the large number of hash tables in the traditional LSH frame.Secondly,the algorithm adopts the idea of heap sort based on hash bucket and returns the similar bucket where the data object is.Replacing data object with hash bucket not only reduces the data dimension,but also reduces the size of data comparison.In addition,the method of heap sorting is adopted to filter out different buckets to further improve the query speed.In order to further improve the speed of similar search,this paper proposes a twolayer LSH search framework(BCLSH)based on bucket chain model.The BCLSH framework uses the bucket chain model to save the array of similar bucket references for each hash bucket when building the index,so that during the query process,the similar bucket of the bucket where the query object is located can be returned within O(1)time.In order to reduce the time complexity of index construction,this paper adopts a two-layer LSH framework.In this paper,Batch Orthogonal LSH(BOLSH)function is used to construct the first layer hash table,and the data space is divided into several subspaces,which not only reduces the data scale of each subspace,but also enables each subspace to constructits own hash table in complete parallel.Secondly,in order to solve the problem of uneven data distribution,BCLSH supports cross-partition search and further improves the comparison rules of heap sort in HSLSH algorithm.This paper proposes a new comparison algorithmTheoretical analysis and experiments show that the HSLSH algorithm proposed in this paper can further shorten the time of similar search compared with C2 LSH and FBLSH algorithms.Compared with LSH forest and DPF framework,the BCLSH framework proposed in this paper has better query efficiency and faster construction speed of hash index.
Keywords/Search Tags:similarity search, LSH, BOLSH, heap sort, bucket chain
PDF Full Text Request
Related items