Font Size: a A A

A Study On Compression Algorithm Performance Based Inverted Index

Posted on:2010-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:S Y PanFull Text:PDF
GTID:2178330338475898Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the era of information explosion, tens of millions of new info is generated in the internet, which leads to rapid increasement of web pages. How to efficiently locate and find target-info in massive level dataset? It makes search engine become one of the most popular techniques and put forward higher requirements for search engine's performance. For search engine, the index structure is closely related to the performance; inverted index is the most commonly used by search engine to represent index. Inverted index regards"term"as the beginning of search, traces numerous information sources containing that term. In inverted index, one per term, recording the identifiers of the documents containing that term, frequencies and positions of term in documents: term->( f d ,t, d i, [ p0 ,…p freq?1 ]). Our research of index compression algorithm is based on the inverted index structure.Applying index compression could not only reduce storage space, but also improve search performance. A compressed index requires less storage space, compressed data makes better use of the available communication bandwidth; more information can be transferred per second than when the data is uncompressed. The total time cost of transferring compressed data and subsequently decompressing is potentially much less than the cost of transferring uncompressed data. In order to improve search performance, search engine usually caches inverted lists in memory; on the premise of equal cache size, more compressed data could be stored and then improve cache hit ratio.In this paper, we compare Variable-Byte, Simple9 and PForDelta three compression schemes on open source information retrieval system—Lucene. Paper focuses on the compression and storage of document ID, frequency and position info in Lucene word-level inverted index. Some new compression algorithms are proposed in recent years, but there is no one has compared the state-of-the-art compression technique on popular, open-source and Java environment's IR system. Our main work includes: 1) improving the implementation of PForDelta, which has the fastest decompression speed, and then demonstrating our improvement with experiment. 2) Discussing the impact of if-then-else construction in decompression process on performance in Java environment. 3) Research whether index file is interleaving has remarkable discrepancies in compression ratio and decompression speed. 4) Comprehensively comparing algorithm's compression ratio and the performance of term and phrase search on the different scale of document sets.As shown in the experiment results, Variable-Byte has the most stable performance under different scale of document sets and performs very well in skipped phrase search. Simple9 has the best compression ratio; but the program performance damage caused by branch prediction has been weakened in JVM, Simple9's term search performance is weaker than other two methods. Owing to removing if-then-else construction in key decompression program, PForDelta has the fastest term search speed. When keeping value distribution non-interleaved, the term search of Simple9 and PForDelta has 5%-8% performance improvement. As a result of skipped phrase search mechanism, the Simple9 and PForDelta which decompresses integers in batch don't perform well on phrase search. But with inverted list becomes longer, PForDelta's phrase search performance gradually improves.
Keywords/Search Tags:Inverted index, index compression, index compression algorithm, performance evaluation, search engine, Lucene
PDF Full Text Request
Related items