Study On Top-k Qury Qrocessing Optimization

Posted on:2017-02-21

Degree:Master

Type:Thesis

Country:China

Candidate:Y Jiang

Full Text:PDF

GTID:2428330569498551

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

As search engines enable people to pinpoint required information quickly from massive web pages,its importance is increasing day to day.At present,large-scale search engines have petabytes of data and have to process millions of queries everyday,consuming bulks of time and hardware resources.Therefore,the research on the optimization of query processing has become the focus of both the industry and the academia world.Top-k query is one of the widely used techniques of search engines.The algorithm returns the first k number of the most relevant results from masses of data,avoiding the scoring of irrelevant documents during query processing.Although top-k query has largely improved system performance,there are still some problems lying in index structure,document filtering strategies and document estimating methods.Accordingly,this paper focuses on the optimization of top-k query processing technology.The main research contents are listed as follows:(1)Based on the self-index structure of inverted index,we analyze and designe the hierarchical self-index structure.This structure,which is built on groups of elements with fixed length,uses iterative approach to extract skipping pointers and constructs the upper level self-index.It implements random access to the inverted index and can effective ly support the two state-of-the-art top-k query processing technologies,being MaxScore and WAND.Experiment results show that the structure significantly reduces the number of data chunks to be decompressed and obviously improves performance.Furthermore,an extendible retrieval system with hierarchical self-index is implemented in the experime nt,which offers a foundation for the following research and validation of the optimiza t io n on top-k query processing.(2)In order to solve the problem of slow initialization in top-k query processing,we studied the principle of MaxScore and WAND,and proposed rapid start algorithms of top-k query based on thresholds.The algorithms extract static top-k information of inverted index and calculate initial threshold in real time for specific query,avoiding redundant processing of large numbers of documents with low relevance.Experime nt results show that the algorithms could effectively estimate initial threshold effective ly and thus dramatically reduce the number of documents entering the result heap,while still maintaining the safety of results.(3)Regarding the estimation of score upper bounds during the top-k query processing,we propose another optimized algorithm based on linear programming.It takes the maximum score of each query subset as the objective function,the restrictio n conditions between query terms as constraints,and then abstracts the problem into a linear programming model.Results from experiments prove that this algorithm ensures safety of the query processing,and lowers score upper bounds of the candidate documents effectively.

Keywords/Search Tags:

search engine, top-k query processing, inverted index, MaxScore, WAND

PDF Full Text Request

Related items

1	Research On Query Processing And Result Caching In Search Engine
2	Caching And Query Pruning Strategy For Efficient Search Engine Query Execution
3	The Design And Implementation Of Vertical Search Engine For Position Query
4	Uyghur Khazak Kirgiz Multilingual Search Enigne Inverted Index Module And Its Implementation
5	Index Compression And Query Processing In Search Engines
6	The Research And Implement Of Index Technology In Search Engine
7	Based On Research And Optimization Lucene Inverted Index Performance
8	The Design And Implementation Of Site Search Engine Based On The Inverted Index And The Trie
9	Design And Implementation Of The LSA Data Processing Subsystem For Baidu Video
10	Efficient Algorithms for Search Engine Query Processing