Font Size: a A A

Research On An Efficient Top-k Query Algorithm Based On MapReduce

Posted on:2018-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:H K LiFull Text:PDF
GTID:2428330518958885Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
It is an important issue in the field of big data research to dig out useful information from big data efficiently and accurately.Top-k query technology is a common pre-sensitive query technology,which is used widely in network system monitoring,meta-search engine,relational database,distributed systems and other scenes.Now,there are the following deficiency in commonly using Top-k query technology:1.The existing Top-k query processing technology is mostly applied to the traditional centralized database,and it is rare to the distributed environment.2.Based on the improvement of centralized database algorithm,top-k query technology currently has been used in distributed environment frequently with some shortage,such as larger amount of information communication,lower query efficiency;3.We do not find mature model of Map-Reduce programming based Top-k query technology up to now.Based on the above deficiency,starting from the design dimension of traditional Top-k query technology,we study the processing mode and performance of classic Top-k query technology,and combining the characteristics of distributed environment,propose a new method based on Map-Reduce parallel programming model of the NTA algorithm to make up for the deficiency in the distributed environment using Top-k query.The contents of this thesis are as follows:1.We analysis Top-k query technology from multiple dimensions,including query model,data Access,query and data uncertainty and ranking function.Second,we discuss some classic Top-k query algorithms such as TA and TPUT.Finally,we analyze the performance of these two algorithms pointing out the insufficient in the two algorithms.2.Based on the merits of classical algorithms,the NTA algorithm with new pruning strategy is presented to improve the Top-k query in distributed environment.First,the data set is pruned by the new threshold and the upper bound rule.Then,the pruned data sets are aggregated and returned to the most suitable result.The advantages of this method are as follows:(1)The new pruning strategy can eliminate more impossible results and improve the efficiency of the algorithm in the middle and middle of the algorithm.(2)The Map-Reduce parallel programming model was used to ensure the parallel execution efficiency of the algorithm.3.At the end of this thesis,the algorithm is designed and implemented on the open source Hadoop platform.According to the experimental results,our algorithm has good efficiency and scalability.
Keywords/Search Tags:Top-k query, BigData, Distributed Environment, MapReduce
PDF Full Text Request
Related items