Font Size: a A A

Study On Top-k Qury Qrocessing Optimization

Posted on:2017-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y JiangFull Text:PDF
GTID:2428330569498551Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As search engines enable people to pinpoint required information quickly from massive web pages,its importance is increasing day to day.At present,large-scale search engines have petabytes of data and have to process millions of queries everyday,consuming bulks of time and hardware resources.Therefore,the research on the optimization of query processing has become the focus of both the industry and the academia world.Top-k query is one of the widely used techniques of search engines.The algorithm returns the first k number of the most relevant results from masses of data,avoiding the scoring of irrelevant documents during query processing.Although top-k query has largely improved system performance,there are still some problems lying in index structure,document filtering strategies and document estimating methods.Accordingly,this paper focuses on the optimization of top-k query processing technology.The main research contents are listed as follows:(1)Based on the self-index structure of inverted index,we analyze and designe the hierarchical self-index structure.This structure,which is built on groups of elements with fixed length,uses iterative approach to extract skipping pointers and constructs the upper level self-index.It implements random access to the inverted index and can effective ly support the two state-of-the-art top-k query processing technologies,being MaxScore and WAND.Experiment results show that the structure significantly reduces the number of data chunks to be decompressed and obviously improves performance.Furthermore,an extendible retrieval system with hierarchical self-index is implemented in the experime nt,which offers a foundation for the following research and validation of the optimiza t io n on top-k query processing.(2)In order to solve the problem of slow initialization in top-k query processing,we studied the principle of MaxScore and WAND,and proposed rapid start algorithms of top-k query based on thresholds.The algorithms extract static top-k information of inverted index and calculate initial threshold in real time for specific query,avoiding redundant processing of large numbers of documents with low relevance.Experime nt results show that the algorithms could effectively estimate initial threshold effective ly and thus dramatically reduce the number of documents entering the result heap,while still maintaining the safety of results.(3)Regarding the estimation of score upper bounds during the top-k query processing,we propose another optimized algorithm based on linear programming.It takes the maximum score of each query subset as the objective function,the restrictio n conditions between query terms as constraints,and then abstracts the problem into a linear programming model.Results from experiments prove that this algorithm ensures safety of the query processing,and lowers score upper bounds of the candidate documents effectively.
Keywords/Search Tags:search engine, top-k query processing, inverted index, MaxScore, WAND
PDF Full Text Request
Related items