Font Size: a A A

Inverted Index Compression Algorithm Research

Posted on:2016-10-25Degree:MasterType:Thesis
Country:ChinaCandidate:F L MaoFull Text:PDF
GTID:2298330467979084Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Inverted indexes are the most fundamental and widely used data structures in in-formation retrieval for the large scale text data.An uncompressed inverted index for a given text collection can be quite large. It is reduced the comsumption of the hard disk space the main memory and the performance of information retrieval is improved to compress to an inverted index. Therefore, in-verted index compression has become an important issue to research.Inverted index compression has two primary areas:lexicon compression and post-ing list compression. In the posting list, the docIDs and frequencies are encoded. Al-ways, those informations in the posting list are numbers.Because an inverted index consists of two principal components, the dictionary and the postings lists, we can study two different types of compression methods:dictionary compression and postings list compression. A posting listing includes docID, frequency and position that are integer.Based on the RFEGC algorithm, RFMEGC algorithm is proposed and implemeted to compress and decompress the docID in an inverted index, RFMEGC algorithm re-duces some calculation steps and space usage of the whole index..RLERFEGC algorithm based on RFEGC is proposed and implemented to encode and decode the docID in an inverted index, we use a run-length encoding to compress these lists (as many consecute1s are generated) we then used RFEGC algorithm to en-code the number. We show that RLERFEGC algorithm reduces the space usage.DPRFEGC algorithm is proposed and implemented to compress and decompress the docID in an inverted index. The existing compression algorithms compress the number sequence that has the fixed length.In DPRFEGC algorithm, dynamic pro-gram-ming is used to partition the number sequence, and use the RFEGC algorithm to com-press and decompress individually its part. DPRFEGC algorithm can reduce the space usage because a different parameter for a variable length number sequence is used to improve the access to data and accelerate the speed of data decompresion.The experimental results show that the proposed RFMEGC algorithm, offers sig-nificantly better compression for docIDs in an inverted index than other compresion al-gorithms. The proposted RLERFEGC algorithm offers better compression for frequency in the inverted index than other compresion algorithm. The proposted DPRFEGC algo- rithm can obtain fast decompression speed.
Keywords/Search Tags:Inverted Index, Dynamic Programming, Run-length Code, Compres-sion and Decompression
PDF Full Text Request
Related items