Font Size: a A A

Research On Inverted Index Compression Based On TermID Sequences Sorting Of Identifiers Reassignment

Posted on:2017-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:Z W GuoFull Text:PDF
GTID:2308330482979280Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Due to the rapid development of information technology, the explosive growth of data, it forms an unprecedented mass of text information in human history. Faced with the massive text information, inverted index which is an efficient full text indexing technique can obtain the required information more quickly and accurately. However, the massive text information forms a huge inverted index, which is up to 300% of the original text, so it is very necessary to compress the inverted index.The generation algorithm of inverted index is generally docID distribution, Posting Lists generation and Posting Lists compression. The common identifier assignment algorithms are based on the URL sort of identifier assignment algorithm and intersection-based identifier reassignment algorithm. And the common Posting Lists compression algorithms are Unary Code, Variable Byte Code, Simple-9 and PForDelta etc.In this paper, we propose an algorithm based on TermID sequences sorting of identifiers. Generate positive lists by traversing the created inverted index, and define the collation of the order of termID sequences in the positive lists, according to the rules, the document records in the positive list are sorted to get a new sequence of documents, then according to the new document order allocates a new identifier for a document, finally, re create inverted index.This paper implements the identifier assignment algorithm based on URL sorting(URL), the intersection-based docID reassignment(IBDA), the algorithm based on the TermID sequences sorting of identifiers(SBDRA) and VByte, Simple-9, Simple-16, NewPFD, OptPFD, PForDelta posting lists compression algorithm. By using the Wikipedia website document data sets, we have implemented 18 groups of hybrid crossover experiments. The experimental results demonstrate that for large scale document data sets, the performance of the algorithm based on the TermID sequences sorting of identifiers is better than the identifier assignment algorithm based on URL sorting and the intersection-based identifier reassignment algorithm, and the inverted index generated by the algorithm has a better overall compression effect.
Keywords/Search Tags:Inverted Index, Identifier, Sorting, Reassignment, Compression
PDF Full Text Request
Related items