Font Size: a A A

Research On Inverted Index Compression Algorithms Based On Pattern Coding

Posted on:2020-12-07Degree:MasterType:Thesis
Country:ChinaCandidate:Z X AnFull Text:PDF
GTID:2428330575998445Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Inverted index is an important part of information retrieval system,which is used to maintain billions of documents and respond to a large number of query operations.With the increasing amount of Internet data,the storage space required for inverted index is also increasing.Inverted index compression algorithm can improve the performance of information retrieval system,and reduce the storage space of index,and speed up query processing.So it has become an important research object.Compared with the traditional bitwise coding,the pattern coding algorithm has the advantages of fast decoding speed and good compression effect.Therefore,it is widely used in inverted index compression.In this paper,byte-aligned coding,fixed-bit coding and word-aligned coding in pattern coding are studied in depth.The main work is as follows:(1)In this paper,the characteristics of byte-aligned coding and fixed-bit coding are analyzed.On this basis,PVU coding compression algorithm is proposed.The algorithm is based on byte-aligned coding and introduces the idea of partitioning in fixed-bit coding.The two-tier structure of byte alignment coding is improved by the three-tier storage structure of "pattern-length-data".Instead of using bytes as the smallest storage unit,the algorithm designs a variety of smallest storage units for each partition to select the optimal compression mode.Thus,the global compression rate is improved.The partitioning strategy of PVU coding is studied,and the partitioning problem is transformed into the shortest path problem in graph theory.This paper designs and implements a dynamic programming method to solve the optimal partition of coding,and proposes the OptPVU coding of partition optimization.A dynamic programming method is designed and implemented to solve the coding optimal partition.On this basis,OptPVU coding with partition optimization is proposed.(2)The distribution characteristics of DocID sequence after pretreatment were analyzed.Based on Simple Family and improved by run-length coding,Simple21 coding compression algorithm is proposed.The algorithm consists of 21 coding modes.When the sequence contains a large number of consecutive zeros,Simple21 can effectively reduce the storage space compared with other Simple Family coding.Simple21 also enhances the maximum storage length by dividing the pattern identifier and the compressed encoding,and enhances the available range of the algorithm.(3)This paper presents and implements three inverted index compression algorithms:PVU coding,OptPVU coding and Simple21 coding.Compared with Golomb coding,Elias-Delta coding,Variable Byte coding,Stream VByte coding,NewPFD coding and Simple9 coding,experiments are carried out.The experimental results show that Simple21 is superior to other compression algorithms in terms of compression rate and decoding speed,and is the best encoding scheme in the experiment.Compared with byte-aligned coding VByte and Stream VByte,PVU and OptPVU have obvious advantages in compression rate.Compared with fixed-bit coding NewPFD,PVU has similar compression effect with NewPFD,and OptPVU achieves better compression rate and decoding speed than NewPFD.
Keywords/Search Tags:Inverted Index, Index Compression, Pattern Coding, Partition Optimization, Simple Family
PDF Full Text Request
Related items