Font Size: a A A

Study On Index Pruning For Web Search Engine

Posted on:2014-02-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:D D DanFull Text:PDF
GTID:1228330392462194Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Currently, more and more people use search engines as the main entrance toobtain information on the Web. But the growing number of web pages and queriesmake the performance of search engine a big challenge. Usually, a moden search en-gine needs to process thousands of queries on tens of billions of web pages per sec-ond. Therefore, how to efficiently process queries has been an important researchproblem in the field of search engine and information retrieval.In this paper, we study the methods of index pruning to enhance the efficiencyof query processing. Usually, the index pruning methods are divided into static indexpruning and dynamic index pruning. Static index pruning methods are used in indexconstruction phase. They remove the less important information in document toshorten the length of inverted list and reduce the size of the inverted index. As theinverted list is shorted, the speed of query processing can be enhanced. Dynamic in-dex pruning method are used in the query processing phase. They usually use somespecial auxiliary structure to skip less important documents. So the query processingspeed can be enhanced a lot. In this paper, we study efficiently query processing bythe static index pruning and dynamic index pruning methods. We propose some newindex structures and algorithms to support efficient query processing.The main work, primary innovations and contribution are as follows:1. In this paper, we propose a web page structure and importance based staticindex pruning method. On the one hand, this method use the web pagestructure to distinguish the importance of terms in different structures. Itpreferentially retained terms in the important structure. On the other hand,it treats the importance of web pages differently and reserved more termsfor important web pages. Experiments on GOV2datasets show that the webpage structure and web page importance can help to remove less importantterms and enhance the query effectiveness. The effectiveness of the pruningmethod which combine these two information as good as using full index when removing80%of the terms. At this time, the storage overhead is re-duced by73%and the speed of query processing enhanced2times.2. In this paper, we propose a framework for dynamic index pruning methodson the Block-Max index. We also propose three new dynamic index pruningalgorithms, respectively: Block-Max MaxScore (BMM), Local Block-MaxWAND (LBMW) and Local Block-Max MaxScore (LBMM). In this framework,the query processing can be divided into three steps:1). Choose the candi-date documents. In this paper, we uses the local block maximum scores in-stead of terms global maximum scores to select the candidate documentswhich can reduce the number of candidate document a lot;2). Filter candi-date documents by using block’s maximum score. We presents a more effi-cient filtering method which can take full advantage of the locality of queryprocessing, and to enhance the efficiency of the filter;3). Scoring candidatedocuments. The proposed algorithms use block-max score to estimate theupper bound score of documents dynamically. It can stop scoring when theupper bound score below current threshold. Experiments show that theproposed algorithms can process queries more efficiently and they are onetimes faster than classical pruning methods.3. In this paper, we propose two kinds of methods to store block maximumstatic score and maximum relevant score. Using those scores, the dynamicindex pruning algorithms with static score rank function can process queriesmore efficiently. One approach is that it stores the maximum relevant scoreand static score for each block respectively; Another method is that it com-bines the relevant score and static score for each document and store max-imum combined score for each block additionally. Obviously, the secondmethod can estimate the upper bound scores of the candidate documentmore accurately. But using combined method, document’s upper boundscore will be underestimate in some case. We analyze those cases and pro-pose some solution to solve it. At the same time, we also discusses the dy-namic index pruning algorithm using document reordering technology. Ex-periments show that when the static scores proportion in ranking function issmaller, using URL ordering index can process queries faster. As the propor-tion increases, it is more suitable to use static score ordering index.4. In this paper, we propose a optimization method to divide intervals for each indexed term in the Space-Max index. We also propose four new dynamicindex pruning algorithms on the Space-Max index, respectively: Space-MaxWAND (SMW), Space-Max MaxScore (SMM), Space-Max AND (SMA) andSpace Max AND with Check (SMAC). The proposed interval divided methoduses two kinds of information: the estimation error of document scores andinterval processing overhead. Under the same conditions, the optimizedmethod can divied an optimal interval for each inverted index. Using thoseinterval’s maximum scores, the new proposed pruning algorithms can bemore accurate to estimate the upper bound score of a document. Thus, theycan find higher quality of candidate documents, and the candidate docu-ments can be stopped scoring earlier. The experiments show that pruningmethods using optimized interval divided are much faster than that usinginterval divided according to the length of inverted list. Meanwhile, theSpace-Max index based pruning methods are more efficient than block-maxindex based methods.5. On the basis of the above research, we redesign and re-implement the“TianWang” search engine The proposed methods can make a search en-gine which can index more documents and processing queries more effi-ciency.
Keywords/Search Tags:Inverted Index, static index pruning, dynamic index pruning, top-k queryporcessing
PDF Full Text Request
Related items