Research On Inverted Index Compression Based On TermID Sequences Sorting Of Identifiers Reassignment

Posted on:2017-04-25

Degree:Master

Type:Thesis

Country:China

Candidate:Z W Guo

Full Text:PDF

GTID:2308330482979280

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Due to the rapid development of information technology, the explosive growth of data, it forms an unprecedented mass of text information in human history. Faced with the massive text information, inverted index which is an efficient full text indexing technique can obtain the required information more quickly and accurately. However, the massive text information forms a huge inverted index, which is up to 300% of the original text, so it is very necessary to compress the inverted index.The generation algorithm of inverted index is generally docID distribution, Posting Lists generation and Posting Lists compression. The common identifier assignment algorithms are based on the URL sort of identifier assignment algorithm and intersection-based identifier reassignment algorithm. And the common Posting Lists compression algorithms are Unary Code, Variable Byte Code, Simple-9 and PForDelta etc.In this paper, we propose an algorithm based on TermID sequences sorting of identifiers. Generate positive lists by traversing the created inverted index, and define the collation of the order of termID sequences in the positive lists, according to the rules, the document records in the positive list are sorted to get a new sequence of documents, then according to the new document order allocates a new identifier for a document, finally, re create inverted index.This paper implements the identifier assignment algorithm based on URL sorting(URL), the intersection-based docID reassignment(IBDA), the algorithm based on the TermID sequences sorting of identifiers(SBDRA) and VByte, Simple-9, Simple-16, NewPFD, OptPFD, PForDelta posting lists compression algorithm. By using the Wikipedia website document data sets, we have implemented 18 groups of hybrid crossover experiments. The experimental results demonstrate that for large scale document data sets, the performance of the algorithm based on the TermID sequences sorting of identifiers is better than the identifier assignment algorithm based on URL sorting and the intersection-based identifier reassignment algorithm, and the inverted index generated by the algorithm has a better overall compression effect.

Keywords/Search Tags:

Inverted Index, Identifier, Sorting, Reassignment, Compression

PDF Full Text Request

Related items

1	Research On Key Technologies Of Full-text Index Compression In Cloud Environment
2	Research On Partition-based Inverted Index Compression Algorithm
3	A Study On Compression Algorithm Performance Based Inverted Index
4	Space- And Time-efficient Compression And Intersection Algorithms For Inverted Index
5	Research On Inverted Index Compression Algorithms Based On Pattern Coding
6	Inverted Index Compression Algorithm Research
7	The Design And Realization Of Image Set Compression Method Based On Inverted Index
8	Index Compression And Query Processing In Search Engines
9	A Research Of Full-Text Retrieval Based On Inverted Index
10	Some Research On On-Line Index For Dynamic Text