| In the era of digital economy,new-generation technologies such as big data,cloud computing,and mobile Internet have spawned massive amounts of unstructured data such as images,videos,and texts.In order to retrieve these unstructured data,deep learning techniques are usually used to convert them into structured vectors,and then perform vector retrieval.Now,the method based on the nearest neighbor graph has become the mainstream algorithm of vector retrieval because of its excellent retrieval ability.However,the current traditional neighbor graph-based methods rely too much on memory,and there is a high memory cost in large-scale data.To solve this problem,current mainstream methods,such as Disk ANN,store the memory-intensive neighbor graph index on a solid-state disk(SSD),which significantly reduces the memory consumption during retrieval.However,the Disk ANN algorithm does not consider the high-concurrency batch query retrieval scenarios common in practical applications,resulting in a large number of redundant SSD access requests,which increases the retrieval delay.This thesis aims to optimize the retrieval algorithm of Disk ANN to reduce the IO time of interacting with SSD in the batch query scenario.The specific research contents are as follows:(1)Batch query rearrangement algorithm based on routing phase characteristics.The routing process of the neighbor graph can be divided into two stages,the first stage is the remote route approximation and the second stage is the approximate brute force search.And in the second stage of the search,the query scope is basically located in the largest strongly connected component(SCC)in the graph.In response to this feature,this thesis adopts the following two strategies to reduce the number of IOs: 1.A batch query clustering algorithm based on SCC similarity is proposed,which improves the reusability of accessing SSD data in the second stage and reduces the number of IOs;2.This thesis optimizes the entry point of the query through the caching strategy to further reduce the number of uncertain random IO reads in the routing process of the first stage.(2)A multi-queue dynamic scheduling strategy aimed at the "long tail effect".In the batch query scenario,the Disk ANN algorithm has the problem of increasing the average IO time caused by the "long tail effect".Specifically,in order to achieve a sufficiently high search accuracy,a small number of queries in the neighbor graph batch query need to take a long IO processing time.Under the current single-queue read strategy,such queries will increase the waiting time of subsequent queries,thereby affecting the average IO time.In response to the above problems,this thesis proposes a multi-queue dynamic scheduling strategy,which relies on two key modifications: 1.By using multiple queues instead of a single queue,the impact of long-tail IO on the overall IO time is reduced;2.this thesis based on The principle of the shortest queue length schedules and allocates queries,reducing the average IO time.(3)Search early termination mechanism based on regional characteristics.In the existing methods,the query termination strategy is fixed and cannot be terminated adaptively according to the query distribution,which is easily affected by the "long tail effect" and leads to redundant calculations.To solve this problem,this thesis proposes a search termination strategy based on region features,which predicts the number of remaining search steps through the region features in the query process,so as to realize dynamic search termination and reduce redundant search overhead.Based on the above achievements,this thesis designs and implements a picture vector retrieval system,and completes the actual retrieval application of hundreds of millions of images,which verifies the retrieval efficiency and effectiveness of the algorithm in this thesis in large-scale high-concurrency vector retrieval application scenarios. |