Font Size: a A A

Multiple Hash Tables Indexing And Optimization For Approximate Nearest Neighbor Search

Posted on:2020-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:J H MiaoFull Text:PDF
GTID:2428330602958019Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nearest neighbor search is a fundamental problem in various domains,such as computer vision,data mining,and machine learning.With the explosive growth of data on the Internet,the problem of nearest neighbor search of massive high-dimensional data has become a challenge.The traditional nearest neighbor search method based on tree structure is weak in the face of massive high-dimensional data,and the nearest neighbor search method based on hash representation has become one of the most effective methods for solving massive high-dimensional data similarity search problems by virtue of its efficient compression storage and fast retrieval speed.However,the nearest neighbor search methods represented by hash have two problems.The first problem is that data is usually mapped to bit codes by multiple hash functions,so different hash functions generate bits of varying quality,and in most scenarios,these bits are treated equally.Another problem is that when the data size is particularly large,the hamming distance calculation using linear scanning is no longer efficient.In this case,a hash table is usually built to improve the efficiency of the search.However,due to the limitations of long bit codes,the traditional single hash table index method has been difficult to meet the efficiency requirements for retrieval.Based on the above problems of the nearest neighbor search method represented by hash,this paper proposes the HMTable algorithm and the MQRank algorithm.HMTable is a multi-hash table index structure with better performance,which is used to solve the inefficient hash table index problem.MQRank is a more fine-grained bit code ranking algorithm that combines the characteristics of multiple tables.Specifically,since a single hash table is difficult to satisfy the index requirement of long bit codes,a multi-hash table is usually constructed by dividing the bit codes.Because some simple partitioning methods can't capture the relationship information between bits,the performance of the constructed multi-hash table is not good enough.In this paper,HMTable divides the bit code by hierarchical iterative method,which makes the group of divided bit code more uniform,thus constructing a higher quality multi-hash table;Since the hamming distance is only a discrete integer value,it is difficult to accurately measure the difference between different data points at the same distance.The main reason is that different bits have the same effect on the distance calculation.Therefore,for the bit consistency problem,this paper proposes MQRank for more fine-grained weighted calculation of hamming distance.The calculation of the MQRank bit weight is performed according to the influence of the neighbors and the distant neighbors on the ability of the hash function partitioning,and the calibration of the weight is executed independently under each sub-table,so that the generated bit weight is more accurate in the nearest neighbor calculation based on the multi-table.In this paper,experimental design and verification are performed on four public datasets MNIST,CIFARIO,SIFT1M and GIST1M,and the experimental evaluation of multi-hash table index and weight optimization is carried out on the bit codes generated by various hash algorithms.A large number of experimental results show that the proposed method is effective and versatile in approximate nearest neighbor search.
Keywords/Search Tags:High-Dimensional Data, Nearest Neighbor Search, Multi-Table Index, Bit Weight
PDF Full Text Request
Related items