Font Size: a A A

The Design And Implementation Of Approximate Nearest Neighbor Distributed Search System

Posted on:2021-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:W HuFull Text:PDF
GTID:2428330647950839Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the expansion of mobile Internet user groups,the data generated by users online every day is extremely huge and users have higher and higher requirements for information consumption patterns.In order to compete for user time,online service providers also need to deliver content that users are interested in to users more accurately.Huge data is hidden behind online search,product recommendation,news reading and other services,which requires the computer to find out the content of interest to users.All this depends on the nearest neighbor search algorithm.Therefore,this paper designs an approximate nearest neighbor distributed search system based on local sensitive hash(LSH)and hierarchical navigable small world graph(HNSW)algorithms,which mainly realizes the rapid construction and efficient retrieval of indexes under very large-scale vectors.Although the introduction of LSH and HNSW algorithms has brought us theoretical support for splitting the index,which can greatly improve the efficiency of index construction and retrieval,it also brings greater complexity to the implementation of the system.All the data in the index built by HNSW is easily lost in memory.In addition,the divided multiple subindexes need to run on different instances in the cluster to ensure that the system will not have a single point of failure.The distributed system architecture is used in the system described in this article to improve the reliability and availability of the search system.The system mainly includes five modules,namely:(1)Cluster information management module: uses dynamic configuration technology to manage cluster static configuration information,collects data node information through heartbeat,uses Raft protocol to complete leader election,and synchronizes data with Follower nodes through broadcasting.(2)Data backup and recovery module: the combination of hot and cold backup is used to save the data in the system and ensure the reliability of the data.(3)Index management module: realize the automatic allocation of sub-indexes when the system starts.Through the heartbeat monitoring index status and redundant deployment technology,to achieve rapid migration in the event of node failure,to ensure high availability of the system.(4)Load balancing module: responsible for optimizing the distribution of data in the subindex,using a local-sensitive hash algorithm to spatially compress vectors,inserting vectors that are closer to the same sub-index,greatly improving the horizontal expansion of the system ability.(5)Data search module: responsible for the system's external data insertion and data search interface,and coordinate the scheduling of the above four modules to complete the function.
Keywords/Search Tags:Locality Sensitive Hashing, HNSW, Distributed System, Load Balance, Redundant Deployment
PDF Full Text Request
Related items