Font Size: a A A

Research On Hash Learning Based Approximate Nearest Neighbor Search Method

Posted on:2020-02-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y CaoFull Text:PDF
GTID:1360330578471744Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the domains in machine learning,it has become a hot re-search problem to realize high-efficiency large-scale approximate nearest neighbor search.Hash learning technology is one of the main streams to solve this problem.However,in the pro-cess of hash learning,there still exist many key problems and technical challenges waiting to be solved.This thesis mainly focuses on three challenges in hash learning:high-effectiveness rank-ing,high-efficiency ranking,distributed encoding and proposes weighted asymmetric distance based effective hash ranking method,weighted Hamming space oriented efficient hash ranking method,large-scale hash learning oriented distributed encoding method,correspondingly.The detailed research content and innovative points are summarized as follows.Hash learning methods can achieve high efficiency in the large-scale approximate nearest neighbor search problem.Nevertheless,how to increase their accuracy attracts the most at-tention.Most researches concentrate on preserving similarities between the original space and the projected space in the hash coding process.Few researches consider the problem of hash ranking.Although it is very efficient to conpute Hamming distance between two binary codes,Hamming distance based hash ranking achieves low accuracy.Therefore,asymmetric distance based hash ranking is proposed,which can divide the distance space more densely and increase search accuracy.However,the discrimination among different projected bits is not taken into consideration.This thesis proposes two weighted asylmetric distance based high-effectiveness hash rank-ing methods.The main difference between them is the computation of the representative points and the weights.In the former,the representative points are based on the Otsu threshold,the weights are related to the distribution of the data points and the distance between the query point and the thresholds.In the latter,the representative points are based on the mean value and the probability based score,the weights arc related to lhe indepe-dence and balance among the hash functions.On the whole,the former is suitable for the feature datasets,the latter is suitable for the image datasets.Experiments show that the two proposed hash ranking methods outperform both Hamming distance(22%)and the asymmetric distance(13%)and compare the accuracy of the two proposed hash ranking methods in different datasets.In order to increase accuracy of the hash learning methods,a number of weighted Hamming distance based hash ranking methods have been proposed in recent years.However,in large-scale approximate nearest neighbor search applications,linear scan in weighted Hamming space is still time consuming.There still exist no efficient search methods except for linear scan in weighted Hamming space.Although multiple tables based data structure has been proposed to realize efficient search in Hamming space,it is difficult to be applied to weighted Hamming space directly.This is because it is hard to locate the hash buckets in a subtable whose weighted Hamming distances are less than or equal to the subtable search radius.This thesis proposes a multiple tables based efficient search method in weighted Hamming space.The main idea is concluded as:by creating two weights based vectors and comparing the values in them with the search radius for the subtable,the corresponding hash buckets can be located efficiently and exactly.In the meanwhile,by creating a candidate set and a cache set to avoid redundant computations of weighted Hamming distances for the data points can further increase search efficiency.Experimental results show that the proposed efficient search method can achieve up to dozens of search time gains compared with linear scan in the dataset which contains one million data points.In some large-scale approximate nearest neighbor search applications,data is often collect-ed and located in a distributed manner.Hence,it is necessary to propose distributed hashing methods.In order to apply single-machine hashing method to the distributed environment,the data points located across the distributed nodes should be transmitted to the same center node.However,there are two main weaknesses:high communication cost and high storage complexi-ty.Hence,a few distributed hashing methods are proposed.The main idea of them is to realize the existing single-machine hashing method in a distributed manner.There are two main dis-advantages in these distributed hashing methods.First,as they rely on some single-machine hashing method,they can achieve consistent accuracy at most.Second,they can not be scaled to other novel single-machine hashing methods.This thesis proposes two distributed hash learning models:general distributed hash learning model and scalable distributed hash learning model.The main difference between them is the data transmitted among the distributed nodes.The former transmits the candidate sets,the latter transmits the parameters.On the whole,the accuracy of the fonmer is higher,the scalability of latter is better.Experimental results show that the two proposed distributed hash learning models get up to 4%higher accuracy than the single-machine hashing method and up to 150%higher accuracy than state-of-the-art distributed hashing methods.
Keywords/Search Tags:Approximate Nearest Neighbor Search, Hash Learning, Weighted Asymmetric Hamming Distance, Weighted Hamming Distance, Multiple Tables Index, Distributed Hashing
PDF Full Text Request
Related items