Font Size: a A A

Research On Partition-based Inverted Index Compression Algorithm

Posted on:2018-10-13Degree:MasterType:Thesis
Country:ChinaCandidate:J T LiFull Text:PDF
GTID:2348330512976857Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Inverted index is currently a widely used full-text index technology,and it also is the core technology of modern search engine.Now the explosive growth of text data on the Internet leads to an increasing number of storage space for the inverted indices for these text data.Compression of inverted indices not only saves disk space,but also reduces disk and main memory accesses.Therefore,the research on compression of inverted indices is very important for full-text retrieval of mass text data.Compression of inverted indices mainly reduces the compression of docID sequences and frequency sequences of inverted indices,and ultimately reduces to the compression of monotonically increasing non-negative integer sequences.Partitioned Elias-Fano(PEF)indexes use a partitioned two-level data structure to compress the integer sequence by partitioning it into chunks,showing excellent compression performance.Partition-based integer sequences compression is used on compression of inverted indices in further study;the main work is as follows:(1)Based on rigorous theoretical proof,Golomb-Rice coding is better than Elias-Fano coding in compression.For a given monotonically increasing integer sequence of n elements where u is any upper bound on the last value,we put forward a method to directly determine the parameter b of Golomb-Rice coding.(2)Combining the excellent compression performance of Golomb-Rice coding with the idea of partitioning a sequence to some chunks that are encoded independently,Partitioned Elias-Fano-Golomb-Rice(PEFGR)coding algorithm and Partitioned Golomb-Rice(PGR)coding algorithm were proposed.PEFGR coding algorithm partitions the integer sequence into chunks and forms a two-level data structure:the first level is an Elias-Fano representation of boundary elements of each chunk;the second level is a Golomb-Rice representation of all elements of each chunk.Compared with PEFGR coding algorithm,the two levels of PGR coding algorithm both are in the Golomb-Rice representation,this further improve the compression performance.(3)PEFGR coding algorithm and PGR coding algorithm are implemented,at the same time,Golomb coding algorithm,Elias-Fano coding algorithm,PEF coding algorithm,Simple-16 algorithm and OptPFD coding algorithm are also implemented.The data sets used in this experiment are from two corpora:Clueweb12 and wikipediaXML.Experimental results on these datasets verify that Golomb-Rice coding offers better compression than Elias-Fano coding.Experimental results on these datasets show that PGR coding algorithm offers the best excellent compression performance,and PEFGR coding algorithm take second place.PGR coding algorithm saves at least 0.3 bits per integer than PEF coding algorithm on the docID,and saves more bits than Simple-16 coding algorithm and OptPFD coding algorithm.
Keywords/Search Tags:Full-Text Index, Inverted Index, Index Compression, Partition, Integer Sequence Compression
PDF Full Text Request
Related items