Font Size: a A A

Research On The Index Of Massive High Dimensional Data Via Multiple Hash Tables

Posted on:2019-01-24Degree:MasterType:Thesis
Country:ChinaCandidate:C F YangFull Text:PDF
GTID:2348330542489044Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of Internet technologies,multimedia data such as texts,images,videos and so on have shown an explosive growth trend.How to search the target data in the massive multimedia data is a hot issue in the field of computer science.In practical applications,multimedia data are generally represented by their feature data,and these feature descriptions are often high-dimensional vectors.At present,traditional indexing techniques such as hierarchical partitioning tree and clustering partitioning tree cannot deal with this kind of massive high-dimensional data well and they are facing the problem of inefficiency.Aiming at the nearest neighbor queries of massive high-dimensional data,a mainstream solution is to map the data into binary codes mainly due to the low storage cost of the binary codes and the fast computation of the Hamming distance.The main research work includes LSH,product quantization,ITQ,K-means hashing and so on.However,the binary representation itself has some problems:First,how to use the binary codes to maintain the spatial neighborhood structure between the original data;secondly,how to use as few bits of binary code as possible to keep the retrieval performance as high as possible;thirdly,how to use binary codes as an index to give efficient indexing and query solutions of massive high-dimensional data when the size of the data is so large that the Hamming distance matching is inefficient.In order to solve the problem of binary representation of massive high-dimensional data,this paper proposes a new index structure and neighbor search algorithm,that is,an index and query algorithm based on multiple hash tables.Firstly,we select the optimal hash bit grouping scheme by measuring the independence between different hash bits.Since the number of combinations between hash bits is of a geometric order,we propose a method of approximating solutions to construct multiple hash tables.Secondly,we construct offline indexes for data points in the original dataset.Thirdly,for a given query point,we look up the search point neighbors in multiple hash tables separately and propose the expansion and optimization method of neighborhood query.Finally,we combine the current popular big data computing framework Spark to discuss the parallel implementation of the algorithm.In order to evaluate the performance of an index and query algorithm based on multiple hash tables,we conduct a large number of numerical experiments on multiple datasets,including public datasets and synthetic datasets,and we compare them with some mainstream hash and index algorithms.Numerical experiments show that the proposed algorithm has some advantages in the retrieval precision,recall and MAP compared with other algorithms.
Keywords/Search Tags:Large-scale Data, High-dimensional Data, Approximate Nearest Neighbor Search, Index based Multiple Hash Tables, Spark
PDF Full Text Request
Related items