Font Size: a A A

Index Compression and Efficient Query Processing in Large Web Search Engines

Posted on:2014-02-07Degree:Ph.DType:Thesis
University:Polytechnic Institute of New York UniversityCandidate:Ding, ShuaiFull Text:PDF
GTID:2458390005992089Subject:Computer Science
The inverted index is the main data structure used by all the major search engines. Search engines build an inverted index on their collection to speed up query processing. As the size of the web grows, the length of the inverted list structures, which can easily grow to hundreds of MBs or even GBs for common terms (roughly linear in the size of the data set), becomes a bottleneck for query processing. Given that search engines need to answer user queries within fractions of a second, naively storing the uncompressed index and simply traversing the posting lists becomes too slow and is not acceptable. There are many optimization techniques focused on speeding up query processing in search engines. Inverted index compression is one very important technique. Our goal of inverted index compression is to reduce the size of the inverted index while at the same time keeping decompression as fast as possible. It is known that a smaller index usually results in faster query processing (better caching performance, less disk IO); top-k query processing (also called early termination) is another important class of techniques which directly aims at decreasing query processing times. All the major search engines apply some form of early termination techniques. Last but not least, all the search engines use a large cluster of machines, which makes the way to distribute indices among machines critical to the search performance.;In this thesis, we focus on these three classes of techniques for efficient query processing, namely, inverted index compression, top-k query processing and indexing strategies in a distributed search system:;In the first part of this thesis, we study how to achieve better inverted index compression. In particular, we study scalable methods to find a document ID assignment to improve the inverted index compression. As mentioned earlier, inverted index compression is important for search engines. Instead of looking for a new compression technique, we study the document ID (docID) assignment problem, which is also critical for good index compression. The docID assignment problem could be described as follows: Instead of assigning documents with random IDs, the goal is to find a better way to assign IDs to documents such that index compression is improved. Why would better docID assignment result in better index compression? It is because by assigning similar docIDs to similar documents, we could make the inverted index more clustered (To be more precise, we actually make each posting list more clustered). If we make the indexes more clustered, we get smaller gaps in the inverted lists and thus better compression (explained in more detail later). This is already shown in preview work [49, 60]: A better document ID assignment could result in a much better compression ratio and more importantly, much faster query processing. However, in previous work, most of the proposed algorithms only scale to a relatively small dataset (usually less than 1 million documents). One exception is the URL-sorting algorithm which is simple but very effective. In this thesis, we propose a scalable algorithm which achieves a better compression ratio than URL-sorting, based on modeling the problem as a Traveling Salesman Problem. Another advantage of our algorithm is that it can be applied to all datasets.;In the second part of this thesis, we study new techniques for early termination. The goal for early termination techniques is to return search results that are as good as possible, and at the same time only process parts of the posting lists. In this thesis, we assume that a standard IR ranking function is used, where the score of a query for a document is equal to the sum of the scores of single terms within the query. In this thesis, we focus on algorithms that return exactly the same results as the naive algorithm. In a nutshell, our algorithm augments the current index structure by adding the max score for each block in the posting lists, and then using this block max score information to skip most parts in the posting lists where the maximum possible score could not make it into the final top-k result set. We experimentally show that our algorithm achieves much better query processing than the state-of-the-art algorithms, with only a slight increase of the inverted index size. Moreover, we show that our algorithm works for both disjunctive and conjunctive queries.;Both of the techniques-docID reassignment and top-k query processing, are important for decreasing the latency of query processing. In the last part of this thesis, we propose some indexing strategies that aim to balance the availability, load distribution, and fault tolerance in a distributed search system. (Abstract shortened by UMI.).
Keywords/Search Tags:Search, Index, Query processing, Document ID, Early termination, Posting lists
Related items