Research On The Index Of Massive High Dimensional Data Via Multiple Hash Tables

Posted on:2019-01-24

Degree:Master

Type:Thesis

Country:China

Candidate:C F Yang

Full Text:PDF

GTID:2348330542489044

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In recent years,with the rapid development of Internet technologies,multimedia data such as texts,images,videos and so on have shown an explosive growth trend.How to search the target data in the massive multimedia data is a hot issue in the field of computer science.In practical applications,multimedia data are generally represented by their feature data,and these feature descriptions are often high-dimensional vectors.At present,traditional indexing techniques such as hierarchical partitioning tree and clustering partitioning tree cannot deal with this kind of massive high-dimensional data well and they are facing the problem of inefficiency.Aiming at the nearest neighbor queries of massive high-dimensional data,a mainstream solution is to map the data into binary codes mainly due to the low storage cost of the binary codes and the fast computation of the Hamming distance.The main research work includes LSH,product quantization,ITQ,K-means hashing and so on.However,the binary representation itself has some problems:First,how to use the binary codes to maintain the spatial neighborhood structure between the original data;secondly,how to use as few bits of binary code as possible to keep the retrieval performance as high as possible;thirdly,how to use binary codes as an index to give efficient indexing and query solutions of massive high-dimensional data when the size of the data is so large that the Hamming distance matching is inefficient.In order to solve the problem of binary representation of massive high-dimensional data,this paper proposes a new index structure and neighbor search algorithm,that is,an index and query algorithm based on multiple hash tables.Firstly,we select the optimal hash bit grouping scheme by measuring the independence between different hash bits.Since the number of combinations between hash bits is of a geometric order,we propose a method of approximating solutions to construct multiple hash tables.Secondly,we construct offline indexes for data points in the original dataset.Thirdly,for a given query point,we look up the search point neighbors in multiple hash tables separately and propose the expansion and optimization method of neighborhood query.Finally,we combine the current popular big data computing framework Spark to discuss the parallel implementation of the algorithm.In order to evaluate the performance of an index and query algorithm based on multiple hash tables,we conduct a large number of numerical experiments on multiple datasets,including public datasets and synthetic datasets,and we compare them with some mainstream hash and index algorithms.Numerical experiments show that the proposed algorithm has some advantages in the retrieval precision,recall and MAP compared with other algorithms.

Keywords/Search Tags:

Large-scale Data, High-dimensional Data, Approximate Nearest Neighbor Search, Index based Multiple Hash Tables, Spark

PDF Full Text Request

Related items

1	Multiple Hash Tables Indexing And Optimization For Approximate Nearest Neighbor Search
2	Hash-based Approximate Nearest Neighbor Search For High-dimensional Data
3	Similarity Search On Large-scale High-dimensional Data
4	Research On K Nearest Neighbor In High Dimension Data
5	Research Of Local Sensitive Hash Index Based On Nearest Neighbor Graph
6	Learning And Indexing Structural Representations Of Large Scale High Dimensional Data
7	Locality Sensitive Hashing Index Based On Neighborhood Collision Counting
8	Unsupervised Binary Code Learning for Approximate Nearest Neighbor Search in Large-scale Datasets
9	Approximate Nearest Neighbor Search For High-Dimensional Based On Nearest Neighbor Graph
10	Quantization Methods For Large Scale Approximate Nearest Neighbor Search On GPUs