The Binary Encoding Algorithm For Improving The ANN Search Performance

Posted on:2018-12-11

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Z Wang

Full Text:PDF

GTID:1318330515476118

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Approximate nearest neighbors(ANN)search,which aims at returning the data points with smaller distance values to the query samples,has been widely applied in the area of machine learning and computer vision.Due to the rapid development of the Internet technology and the widespread use of the multimedia equipment,the era of massive multimedia data has reached.Different from the text information,the features of the multimedia data are represented as high dimensional floating-point vectors.As a result,how to fast respond the ANN search of the massive high dimensional vectors has become a hot issue in the academia and business area.Hashing algorithms map the high dimensional floating-point vectors into the Hamming space,and represent them as compact binary codes.Thus,the approximate nearest neighbors are returned according to the similarity among their binary representations.As the memory occupancy rate and the Hamming distance computation complexity are relative lower,the ANN search of the massive high dimensional vectors can be fast done in the Hamming space.However,the binary codes' weak discrimination and some hashing algorithms' limitation would lead to worse ANN search performance,and these problems have not been well solved.In this paper,all stages of the ANN search framework have been studied,and the innovation measures are proposed to minimize the differences between the retrieval results in the Hamming space and those in the original space.The ANN search procedure often includes the initial retrieval stage and the post-verification step.The initial ANN search results are mainly determined by the quality of the binary codes which relate to the objective function and the convergence results.During the post-verification stage,whether the ANN search performance can be further improved lies on the distance table.The research points mentioned above are studied in this paper,and we mainly explore how to quickly obtain the excellent binary codes and the optimal distance table.The research contents and innovation points are shown as below:1.The normalized distance similarity preserving hashing is proposed.The bias between the data pairs' different kinds of normalized distance values is tried to be minimized,which guarantees the data pairs' similarity in the Hamming space is close to that in the Euclidean space.According to the definition of the objective function,the value range of the Hamming distance should be identical to that of the original distance.To fulfill the above goal,the normalization procedure is employed in this paper.As a result,no parameter is involved in the objective function.When the training database and the binary codes' length changing,there is no need comparing and selecting the parameter which makes the ANN search performance best.Thus,the algorithm's training complexity become relative lower,and the ANN search performances are more stable.The objective function is defined based on the discrete function of the binary encoding procedure,and it is diffcult to directly optimize such an objective function.In order to avoid the above situation,the Sigmoid function is adopted to relax the binary encoding procedure.For the massive database,the batch stochastic gradient descent algorithm,which can reduce the training time complexity,is employed to learn the hashing function.2.The ranking consistence hashing is proposed in this paper.Firstly,the data pairs are separately sorted according to their Hamming and Euclidean distance values.Then,the differences between the data pairs' ranking orders in the two spaces are required to be minimized.Based on the above definition,the relative similarity among the data pairs which satisfies the requirements of the ANN search is preserved.Meanwhile,the quantization error,which demands the data points with the same binary code should be near neighbors,is employed to make the obtained binary codes distribution adaptive.During the training stage,the hierarchical mechanism and the sub-space method are separately utilized to reduce the training time complexity.On the basis of the learnt encoding centers,the out-of-samples can be encoded through lookup-based mechanism.However,the encoding time complexity is higher,and it is not capable for the massive database.To solve this problem,the framework similar to the two-step mechanism is constructed to supervise learning linear hashing functions.The final hashing functions are formed through combining many weak ones assisted by the adaboost mechanism,which can largely preserve the locality sensitive information.3.The gradient descent algorithm is employed to optimize the objective function in many hashing algorithms.However,it usually converges when the local optimum values are found.In order to solve the above problem,many groups of encoding centers are initialized according to the data distribution and the length of the binary bits.As a result,the algorithm converges in reason time and all minimial vaules are found.The data points are encoded as multiple binary codes and the initial retrieval results are returned according to the average Hamming distance.In the Hamming space,many data points share the same distance value with the query sample,but their similarity degrees are not identical in the original space.To avoid the situation mentioned above,the distribution adaptive distance table is established in this paper.Different from the commonly used distance table which considers the mean value of the data with the same binary code as centers,the values in our distance table represents the distances among different cluster centers.As a result,the discrimination ability is enhanced.Due to the finite length of the binary codes,many pairs of the cluster centers share the same binary index,and their distance values are stored in the same position of the distance table.During the post-verification stage,the data pair's distance can be obtained through computing the intersection of the distance sets corresponding to the data pair's multiple binary indexes.According to the obtained distance values,the ANN search performances can be further improved through re-sorting the initial retrieval results.

Keywords/Search Tags:

Approximate nearest neighbors search, Hashing algorithm, Binary codes, Post-verification, Distance table

PDF Full Text Request

Related items

1	Research Of Approximate K-Nearest Neighbors Search Algorithm Based On Locality Sensitive Hashing
2	Binary Hashing And Quantization Methods For Approximate Nearest Neighbor Search
3	Study On The Efficient Approximate Nearest Neighbor Search For Massive Data
4	Cross-Modal Hashing For Efficient Multimedia Retrieval
5	A Study Of Hashing For Fast Approximate Nearest Neighbors Retrieval
6	Research On Hashing Accelerated Approximate Nearest-Neighbor Search
7	Research Of Approximate Nearest Neighbor Search Based On Locality Sensitive Hashing
8	The Design And Analysis Of Sub-Linear Time Algorithm For Several Nearest-Neighbor Problems
9	Hash-based Approximate Nearest Neighbor Search For High-dimensional Data
10	Research On Local Sensitive Hashing And Approximate Nearest Neighbor Algorithm