Font Size: a A A

On Improving Context-Free-Grammar-Based Index Compression Algorithm

Posted on:2020-10-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z H ZhangFull Text:PDF
GTID:1488306461465484Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Among numerous Internet and mobile applications,search engine is a fundamental application for people to retrieve information,and always faces serious performance challenges.Storage and calculation are two critical aspects for performance optimization,which are always interacted on each other.Since inverted index is the kernel data structure of search engine,index compression not only involves massive data storage,but also has a direct impact on retrieval efficiency.As a result,a large amount of research has been driven with the goal of minimizing the space consumption of index while increasing query processing speed.At present,the most typical compression scheme of inverted index is integer sequence encoding,which aims to reduce the storage bit widths of elements by well-designed compact formats.There is a natural similarity among documents written in a human language.According to this similarity,there were some proposed index compression schemes aimed to reduce redundancy among inverted lists.These schemes reduced the space consumption of inverted index by replacing duplicated data with more compact representations.However,most of them kept focus on highly repetitive document collections,and few of them were evaluated on regular web-page document collections.Moreover,most related works only paid more attention to achieve better compression but had not considered how to improve query processing performance on the compressed index.In addition,many index structures based on a Bitmap format were purposed.Especially in recent years,a well-designed retrieval model has been implemented in commercial search engine platform,of which kernel data structure is entirely based on Bitmap format,namely signature file,with high efficiency on query processing.However,this new index scheme has a serious problem on space occupancy,while demands compression scheme with special design requirements.To end this,this thesis introduces a compression scheme based on context-free grammar,which not only includes compression algorithms for inverted index and signature file,but also illustrates retrieval algorithms on such grammar-based compressed indexes,respectively.This thesis aims to propose a unified solution for the critical problems of storage and calculation in the performance optimization of search engine platform.The specific work of this thesis includes the following aspects:Firstly,an inverted index compression scheme based on context-free grammar is proposed.By analyzing the realistic datasets,we find that redundancy is ubiquitous in inverted indexes of web-page document collections.As a result,a redundancy identification algorithm is proposed,which aims to find repetitive document identifier sequences and reduce the space overhead of these redundancies by reduction replacements.Replacing redundant sequence with a pattern identifier not only reduces space occupancy,but also provides a potential acceleration for query processing.At this end,a query processing algorithm which aims to reduce the number of comparisons during document retrieval is proposed.In order to improve the performance of compression and retrieval,this thesis further proposes a grammar compression scheme combined with integer sequence encoding,and introduces a complete hierarchical index structure based on such compression scheme.As a result,encoded grammar compression shows a reduction up to 17% and 12% over Opt PFD in the average number of bits required for each document identifier,namely bpi,on Gov2 and Clue Web09,respectively.While the reduction over Eilas is up to 8.7% and 6.4%.Besides,the improvements of intersection time of encoded grammar compression over Opt PFD is up to 20% and 8.4%.However,it underperforms Elias on intersection time up to 15% and 9.9%.Secondly,several strategies for optimizing the performance of grammar-based index are proposed.To increase similarities among doc ID sequences,we illustrate a top-down doc ID reassignment method.Then,regarding the specific requirements for storage overhead and retrieval efficiency,two dictionary reduction methods are proposed.According to the actual contributions to compression and query processing,patterns in dictionary are pruned to improve the overall space-time performance.Finally,an alternative grammar compression scheme which aims to accelerate query processing speed is proposed.We analyze the accelerations obtained by grammar representation,then purpose an index scheme which combines context-free grammar compression with dynamic partition of inverted lists.Based on the train query distribution,a selective grammar-based compression scheme is purposed to improve the retrieval efficiency of grammar-based index.For Gov2 and Clue Web09,the improvement of bpi of selective grammar-based compression over Packed Binary is up to 11% and 9.4%,while the improvement of intersection time is up to 8.1% and 14%,respectively.Finally,a lossy compression scheme for signature file is proposed.By verifying the performances of various encoding methods on realistic data,the critical problems for compressing signature file are expounded.To end these critical problems,we analyze and verify the rationalities and limitations of grammar-based compression method.Due to the defects of lossless grammar compression scheme,a lossy grammar-based compression method and a sequence intersection algorithm is further proposed.Then we pay more attentions to analyze the problem of compression loss.Moreover,a selective grammar compression scheme and a hybrid index based on grammar compression is purposed to reduce false positive errors caused by lossy compression.The matrix size of signature file built from Gov2 corpus with selective grammar compression is reduced by 19% under the condition that additional false positive rate caused by lossy compression is up to 6.7%.The size of hybrid index is reduced by 62% compared to the reduced signature file,while the hybrid index increases intersection time up to 14%.The context-free grammar provides a unified solution for the storage and calculation problems of the kernel data structure in search engine,thus indicates different directions for improving the space-time performance of index structures.Although this thesis keeps focus on the optimization of search engine,the proposed methods are also valuable in other related scenarios involving large-scale document retrieval.
Keywords/Search Tags:search engine, inverted index, signature file, contex-free grammar, compression
PDF Full Text Request
Related items