Font Size: a A A

On Optimizing List Intersection And Early Termination In Search Engines

Posted on:2013-02-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:F ZhangFull Text:PDF
GTID:1228330395489924Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
A huge number of user queries are processed by search engines every day. Thequality of results and the efficiency of query processing are the two most importantcriteria to evaluate a search engine’s performance. In recent years, both academic andindustry community made greate effort to improve the efficiency of query processing.We focus on optimizing the efficiency of query processing of the search engines.List intersection is an important operation in the processing of conjunctive booleanqueries. We systematically review the state-of-the-art list intersection algorithms andevaluate them with different kinds of intersection and search strategies using real worldquery sets on GOV and GOV2data sets. Previous works on list intersection usuallyuse the number of comparisons as the measure of the algorithm’s complexity.However,reducing the number of comparisons does not necessarily lead to performance improve-ment. We also investigate the intersection algorithms in terms of branch prediction andcache misses which may also affect the performance of the algorithms, and find thathash segmentation approach with “in-segment” linear search benefit from less cacheand branch prediction misses, it has the best CPU running time. We study the impactof document reordering on the performance of list intersection algorithms. Most of thealgorithms achieve better performance on the URL-sorted index than a random one.We find that the intersection algorithms have less cache and branch prediction misseson the URL-sorted index. We also analyze the performance according to the relativelength ratio of the lists, and design a heuristic scheduling policy which results in sig-nificant speedup over the best single intersection algorithm. Last, we investigate theperformance of intersection of compressed lists.The Graphics Processing Units (GPU) has massive parallel computation power.In this paper, we study how to improve the efficiency of the query processing in thesearch engine using GPU. In particular, we focus on optimizing system throughputwhen the system is under heavy query traffic or processing non-interactive queries. Wepropose a CPU-GPU cooperative model to accelerate the list intersection operation in the search engine. Experimental results show our methods can significantly improvethe throughput of query processing. When system is in high query traffic or processingnon-interactive queries, the synchronous mode is used. In this mode, several queriesare grouped at CPU end and sent to GPU in batches to be processed. Since GPU hasmassive parallel processing power, the throughput of query processing can be great-ly increased. We extensively study the search algorithms used in the GPU intersec-tion algorithm. In particular, when two inverted lists1and2are to be intersected(|1|≤|2|), for each element of1, the GPU thread performs some search algorithm tofind whether the element appears in2. The simplest search algorithm is binary search,we choose it as the baseline algorithm, and we also propose the linear regression searchalgorithm, hash segmentation algorithm and bloom filter based algorithm. The exper-imental results show that the hash segmentation has the best performance in terms ofGPU running time. In this paper we also study the impact of document reorderingon the performance of GPU intersection algorithms. Last, we study the performanceof the document evaluation and ranking steps in the query processing, and propose abatch ranking algorithm which is suitable for GPU processing.One important technique to improve retrieval efficiency is early termination, whichspeeds up query processing by avoiding scanning the entire inverted lists. Most earlytermination techniques first build new inverted indexes by sorting the inverted lists inthe order of either the term-dependent information, e.g., term frequencies or term IRscores, or the term-independent information, e.g., static rank of the document and thenapply appropriate retrieval strategies on the resulting indexes. Although the methodsbased only on the static rank have been shown to be ineffective for the early termina-tion, there are still many advantages of using the methods based on term-independentinformation. We propose new techniques to organize inverted indexes based on the ter-m independent information beyond static rank and study the new retrieval strategies onthe resulting indexes. We perform a detailed experimental evaluation on our new tech-niques and compare them with the existing approaches. Our results on the GOV andGOV2data sets show that our techniques can improve query efficiency significantly.
Keywords/Search Tags:search engine, list intersection, graphics processing unit, earlytermination
PDF Full Text Request
Related items