Font Size: a A A

Approximate Nearest Neighbor Search Algorithms And Their Application

Posted on:2022-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:F S LiuFull Text:PDF
GTID:2518306725493324Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the era of big data,massive amounts of data are generated every day,and technologies represented by deep learning are widely used to extract information from complex data.Deep learning technology expresses objects by encoding complex objects such as text,speech,images,etc.into high-dimensional vectors,and measures the similarity between objects through the distance between the vectors.When building various application systems,such as recommendation systems,it is often necessary to perform similar feature retrieval on large-scale,high-dimensional vectors.To solve similar feature retrieval problems on large-scale high-dimensional vectors,various approximate nearest neighbor search(ANNS)algorithms have been proposed.Although there are various approximate nearest neighbor search algorithms,applying ANNS in practical applications still has many difficulties.Firstly,it is difficult to choose suitable ANNS algorithms for specific application scenarios between a variety of ANNS algorithms.Secondly,for scenarios that require extremely high recall rates,many existing algorithms are not suitable.Finally,when performing similar feature retrieval on large-scale data,the index model constructed by the algorithm will take up a lot of memory.How to reduce the size of the constructed index.This paper summarizes the above problems as the selection problem of ANNS algorithm in general scenarios and the effective reduction of index size under limited memory resources conditions.In response to the above problems,this article has done the following work:(1)Our analysis shows that HNSW is the preferred algorithm in general scenarios.But the deletion of nodes in HNSW index may cause insufficient results returned by some search requests.After analyzing the graph structure of HNSW index,we find out that this is caused by the lack of a correct node deletion algorithm in HNSW.Based on this observation,we propose a new node deletion algorithm,which we call HNSW Mutual-Remove,that effectively avoids the problem of insufficient search results while deleting nodes efficiently.Besides,through the analysis and experiment of HNSW and linear scan,we show that in scenarios where a very high is required,GPU-based acceleration linear scan can be considered(2)We analyze the problem of high memory usage of ANNS algorithms by studying the index structure of HNSW.We think that this can be solved by compressing the high-dimensional vectors or adopting more lightweight ANNS algorithms.Although the IVF-HNSW algorithm combines the above two points to significantly reduce the index size,its index construction speed is slow.So we propose a new index construction method,which we call Balanced IVF-HNSW,that could greatly accelerate the construction of the index while maintaining good performance.(3)Successfully apply the above-optimized algorithm to wexin's distributed largescale data ANNS library Sim Svr,which has indexing billions of data from many applications inside wexin.The experiments and application of our optimized ANNS algorithms to Sim Svr show that the HNSW Mutual-Remove proposed by us successfully solves the problem that HNSW lacks a suitable node deletion algorithm and makes HNSW more versatile.At the same time,the Balanced IVF-HNSW proposed in this paper effectively accelerates the indexing speed of IVF-HNSW,making IVF-HNSW more practical in scenarios with limited memory resources.
Keywords/Search Tags:Information Retrieval, Approximate Nearest Neighbor Search, Method Selection, Model Compression
PDF Full Text Request
Related items