Font Size: a A A

Research And Implementation Of Inverted Index Compression Algorithm For Distributed Systems

Posted on:2023-03-01Degree:MasterType:Thesis
Country:ChinaCandidate:J WanFull Text:PDF
GTID:2558306902951249Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In distributed storage systems,search engines are the main tool for fast retrieving files,and inverted index is the core data structure of a large-scale search engine,which is essentially an ordered set of integers sequences called a posting list.Due to the excessive number of documents indexed by such engines,and the strict performance requirements for query load,there are billions of integers in an inverted index,for which the query process must be retrieved efficiently.Compressing inverted indexes has two advantages:in terms of time,more data can be placed in a faster memory structure,speeding up queries;in terms of space,reducing the number of machines required to store the indexes,reducing storage space.The performance of index compression algorithm needs to consider the metrics of encoding speed,compression ratio and query time of the algorithm.This paper takes Partitioned Elias-Fano index compression algorithm as the research object,and conducts the research based on the above performance indicators.The work of this paper mainly includes the following three aspects:(1)A PEF(Partitioned Elias-Fano)algorithm based on genetic algorithm is proposed,aiming at the problem of too slow PEF index encoding process.The algorithm regards the partitioning problem of the posting list in the PEF algorithm as the shortest path problem of solving the graph,and uses a priority-based approach to encode the partitioning scheme as a chromosome in the genetic algorithm process,after multiple rounds of genetic operations,which can obtain the suboptimal partitioning scheme with a high probability,and the optimal partitioning scheme with a small probability in linear time.The experimental results show that,compared with the original algorithm,the genetic algorithm based PEF improves the index generation speed by 28%,and maintains a similar space-time trades-off similar to the PEF algorithm.(2)A PEF algorithm based on Mean Shift clustering is proposed,aiming at the problem that the PEF algorithm can only encode a single posting list at a time,and fails to reduce redundancy by using similar subsets between lists.The algorithm clusters the posting lists and constructs a ad-hoc reference list for each cluster,all of which are encoded using the Elias-Fano algorithm.At the same time,two algorithms for heuristically constructing reference lists were also designed,but both differ in the way they add posting list elements to the reference list:according to the frequency of their occurrence in the cluster;or according to the number of bits required for encoding.The experiments results show that,on average,0.5 bits can be saved per element compared to the PEF algorithm,without affecting the query time.(3)In order to implement the file resource retrieval function of the distributed system,a search engine is designed and implemented for the system.It mainly consists of six modules,namely user registration module,user login module,index construction module,index compression module,file resource retrieval module and personal information management module.The system has passed the relevant functional tests,the relevant functions have been basically implemented,and the better performance of some inverted index compression algorithms are applied to it.
Keywords/Search Tags:Inverted Index, Genetic Algorithm, Clustering Algorithm, Search Engine, Index Compression
PDF Full Text Request
Related items