Font Size: a A A

Index Compression And Query Processing In Search Engines

Posted on:2016-08-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:K JiangFull Text:PDF
GTID:1318330536467165Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the growing popularity of Internet and the rapid development of Internet applications,Internet resources has become the largest and richest,the most widely used source of information knowledge.In order to effectively retrieve information from these massive data,Web search engines have become the best technical means to provide users with fast resource targeting.However,search engines have to deal with a rapidly increasing amount of information,high query loads and tight performance constraints.How to efficiently handle queries has become an important research question in the field of search engine studies.Any improvement in query processing efficiency can reduce the operational costs and improve user satisfaction,hence improve the overall benefit.In this thesis,we elaborate on query processing efficiency,address several problems within inverted index compression,index traversal and dynamic pruning and propose several novel techniques:(1)Index compression is partially responsible for the current performance achievements of Internet search engines.We study the state-of-the-art index compression technique Simple9.However,the number of wasted bits in Simple9 remains large.We focus on the wasted bits in the integer list,padding extra zeros for a complete dense mode when the number of integers is not enough to fit a complete mode.More precisely,we first propose a novel index compression method called Dense Simple9 with dense padding modes to achieve a more compact storage compared with that of Simple9.We then design an ingenious metric for extracting the inserted extra zero integers during the decoding phase.Meanwhile,in order to speedup the compression and decompression efficiency,we propose a group Simple9 compression technique,named Group Simple9 to reduce the data compression and decompression latency.In essence,it shortens the process of reading,branch,shift and map table finding and other operations.Experimental results on the TREC WT2 G and GOV2 data sets show that the proposed group and dense Simple9 compression algorithms can significantly enhance the compression ratio,compression and decompression speed.(2)In order to improve the performance of the query speed,we study the state-of-the-art dynamic pruning algorithms and focus on the most common used one called Max Score.The current Max Score algorithm chooses candidate documents in essential lists,and the postings from non-essential lists can only provide parital scores.Further,Max Score algorithm access the essential lists sequentially,but for non-essential it only uses random access.Thus,there exists a large amout of mis-scoring candidate documents in Max Score when the candidate is less important.To solve this problem,we implements a hierarchical self-index structure to support random access of inverted lists,and then enabling skipping not just in non-essential lists as the original method does but also in essential lists.We call it Essential List Skipping Max Score(ELS-Max Score),which can provide more promising candidates for scoring.Experiments with TREC GOV2 data set show that our ELS-Max Score processes significantly less elements,thus reduces the average query latency by almost 18% over the Max Score baseline and 84% over the OR_DAAT baseline,while still returns the same results as the disjunctive evaluation.(3)We propose another novel exhaustive index traversal scheme called Largest Scores First(LSF)retrieval,in which the candidates are first selected in posting list of important query term with largest upper bound scores and then fully scored with the contribution of the remaining query terms.The scheme can effectively reduce the memory consumption of existing term-at-a-time(TAAT)and the candidate selection cost of existing document-at-a-time(DAAT)retrieval at the expense of revisiting the posting lists of the remaining query terms.Preliminary analysis and implementation show comparable performance between LSF and the two well-known baselines.The LSF retrieval can make full use of the term importance to provide more promising candidates for scoring.To further reduce the amount of postings that need to be revisited,we present efficient rank safe dynamic pruning techniques based on LSF,including two important optimizations called term omitting(LSF_TO)and partial scoring(LSF_PS)that fully make use of the query term importance.Finally,experimental results with the TREC GOV2 collection show that our new index traversal approaches reduce the query latency by almost 27% over the WAND baseline and are slightly better than the Max Score baseline,while still returning the same results as the exhaustive evaluation.
Keywords/Search Tags:Inverted Index, Index Compression, Simple9, Top-k Processing, Index Traversal, Dynamic Pruning
PDF Full Text Request
Related items