Font Size: a A A

Efficient TopK Processing In Web Search Systems

Posted on:2010-06-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:M J ZhuFull Text:PDF
GTID:1118360302463029Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
The World Wide Web grows so rapidly that it has become the largest and the most popular source of readily available information in the world. To fully utilize the information on the Web, web search engines are vitally required. As a result, research on various web search techniques becomes increasingly important in the area of information retrieval. Due to the huge amount of data and massive search requests, the search efficiency is a key problem for the application of web search engines. As a solution to improve the search efficiency, TopK processing aims to get top K correct results from the inverted index with as small space and time cost as possible. An efficient TopK processing approach is crucial to the web search efficiency. In the thesis, we study the TopK processing problem in web search systems, including both general web search and object-based web search, and propose several efficient and effective TopK processing algorithms. To be specific, we make the following contributions:·For general web search,we study the TopK problem with some especially important factors which are missed in existing research, like term proximity and web page structure. Our investigation demonstrates that, when term proximity is incorporated into ranking functions, most existing index structures and top-k strategies become quite inefficient. To solve this problem, we built the inverted index based on web page structure and proposed the query processing strategies accordingly. The experimental results indicate that the proposed index structures and query processing strategies significantly improve the top-k efficiency. Moreover, we propose a Proximity-Probe Heuristic to make our top-k algorithms more efficient. We also test the efficiency of our approaches on various settings (linear or non-linear ranking functions, exact or approximate top-k processing, etc). Besides, the efficiency of index building and maintenance would not be affected too much with our proposed approaches.·Phrase index has been successfully adopted to process phrase queries in existing work. Due to the Phrase-friendly property of term proximity, we propose to build a compact phrase index to speed up the search process when incorporating the term-proximity factor. The compact phrase index can help more accurately estimate the score upper bounds of unknown documents. The size of the phrase index is controlled by including a small portion of phrases which are possibly helpful for improving search performance. Phrase index has been used to process phrase queries in existing work. It is, however, to the best of our knowledge, the first time that phrase index is used to improve the performance of generic queries. Experimental results show that, compared with the state-of-the-art TopK computation approaches, our approach can reduce average query processing time to 1/5. We also try the experiments to use the compact phrase index for web page structure-based inverted index for better pruning performance.·We study the aggregation-aware TopK processing: efficiently retrieving TopK items of one type based on the inverted index of another type of items. It would be very inefficient by directly utilizing traditional TopK approaches. We follow TA (the Threshold Algorithm) in this scenario. We present an aggregation-aware top-k computation framework with three pruning principles upon the conventional inverted index and a novel inverted index type HybridRank, which employs the item information of both types. Experimental results show that our proposed new index structure and the aggregation-aware TopK strategy provide an efficient solution for this aggregation-aware TopK problem.In summary, we study the efficiency TopK processing problem for web search. Various experiments on real web search data sets indicate that our methods are efficient and applicable.
Keywords/Search Tags:Web Search Engine, Information Retrieval, Performance optimization, TopK, Index Pruning
PDF Full Text Request
Related items