Font Size: a A A

Efficient Algorithms for Search Engine Query Processing

Posted on:2017-01-24Degree:Ph.DType:Thesis
University:Polytechnic Institute of New York UniversityCandidate:Dimopoulos, KonstantinosFull Text:PDF
GTID:2468390011995432Subject:Computer Science
Abstract/Summary:
Large search engines receive billions of queries per day that need to be processed in fractions of a second over trillions of documents. To process this work-load, such engines use hundreds of thousands of machines distributed over multiple data centers, resulting in very significant hardware and energy resources. Query processing is responsible for a significant part of this cost, and a lot of research has focused on techniques that decrease this cost by optimizing the inverted index, the main index structure used by current search engines. Some important methods for this task include index compression and early-termination algorithms that, using various shortcuts, try to identify the best or most promising results without an exhaustive evaluation of all candidates. In this thesis, we focus on improving the efficiency of top-k query processing algorithms using early termination and index compression techniques.;In the first part, we study early-termination strategies to accelerate disjunctive top-k query processing algorithms using the Block-Max Index, which is an augmented inverted index structure that stores block-wise score approximations. We propose a new layout for this auxiliary structure that provides extremely fast block maxscore lookups using bit shifts and very aggressive early termination. The experimental evaluation of the proposed Block-Max Index on new and existing query processing algorithms shows important speed-ups over the previous fastest results.;In the second part, we describe how to design a general mechanism that can improve the performance of a whole class of disjunctive top-k query processing algorithms. Our filtering mechanism is based on a combination of Block-Max Indexes and bitmaps, exploits the SIMD capabilities of current microprocessors, and is optimized through caching policies that select and store suitable filter structures based on properties of the query load. We experimentally show that the mechanism results in very significant performance improvements for several well-known disjunctive top-k query processing algorithms.;Finally, we study online document routing and index reorganization strategies that optimize the performance of top-k queries in distributed search engines. More specifically, we focus on a framework where a document dispatcher receives streams of newly crawled documents that must be routed to a single index partition. Previous studies showed that document routing in this setup is crucial for index compression. We propose space/time-efficient routing schemes based on the URL of the documents that outperform previous methods in terms of aggregated index size. We also generalize the previous framework by allowing index reorganization through merging and docID reassignment policies, and experimentally show how the right combination of routing methods and index reorganization strategies results in more compact indexes and significantly improves the query processing performance of several conjunctive and disjunctive top-k retrieval algorithms.
Keywords/Search Tags:Query processing, Algorithms, Index, Search, Performance, Results
Related items