Research On An Efficient Top-k Query Algorithm Based On MapReduce

Posted on:2018-11-21

Degree:Master

Type:Thesis

Country:China

Candidate:H K Li

Full Text:PDF

GTID:2428330518958885

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

It is an important issue in the field of big data research to dig out useful information from big data efficiently and accurately.Top-k query technology is a common pre-sensitive query technology,which is used widely in network system monitoring,meta-search engine,relational database,distributed systems and other scenes.Now,there are the following deficiency in commonly using Top-k query technology:1.The existing Top-k query processing technology is mostly applied to the traditional centralized database,and it is rare to the distributed environment.2.Based on the improvement of centralized database algorithm,top-k query technology currently has been used in distributed environment frequently with some shortage,such as larger amount of information communication,lower query efficiency;3.We do not find mature model of Map-Reduce programming based Top-k query technology up to now.Based on the above deficiency,starting from the design dimension of traditional Top-k query technology,we study the processing mode and performance of classic Top-k query technology,and combining the characteristics of distributed environment,propose a new method based on Map-Reduce parallel programming model of the NTA algorithm to make up for the deficiency in the distributed environment using Top-k query.The contents of this thesis are as follows:1.We analysis Top-k query technology from multiple dimensions,including query model,data Access,query and data uncertainty and ranking function.Second,we discuss some classic Top-k query algorithms such as TA and TPUT.Finally,we analyze the performance of these two algorithms pointing out the insufficient in the two algorithms.2.Based on the merits of classical algorithms,the NTA algorithm with new pruning strategy is presented to improve the Top-k query in distributed environment.First,the data set is pruned by the new threshold and the upper bound rule.Then,the pruned data sets are aggregated and returned to the most suitable result.The advantages of this method are as follows:(1)The new pruning strategy can eliminate more impossible results and improve the efficiency of the algorithm in the middle and middle of the algorithm.(2)The Map-Reduce parallel programming model was used to ensure the parallel execution efficiency of the algorithm.3.At the end of this thesis,the algorithm is designed and implemented on the open source Hadoop platform.According to the experimental results,our algorithm has good efficiency and scalability.

Keywords/Search Tags:

Top-k query, BigData, Distributed Environment, MapReduce

PDF Full Text Request

Related items

1	Working Principle And Applied Research Of MapReduce
2	Efficient K-dominant Skyline Query Based On Dominate Hierarchical Tree In MapReduce Environment
3	MDE-Based Approach For Mapreduce Bigdata Transformation Software Development
4	Top-k Skyline Query Algorithm Based On Data Partition In Distributed Environment
5	Research On Reverse Skyline Query Algorithm Based On SR Tree Under MapReduce Model
6	Research On Distributed Reasoning And Query Method Based On Domain Ontology
7	Mapreduce Job Scheduling For Heterogeneous Geo-distributed Clusters
8	Research And Implementation Of Data Placement And Query Techniques Based On MapReduce In Distributed Multi-Dimensional Data Warehouse
9	The Research Of Regular Path Query On Large-scale RDF Graph
10	Research And Improvement Of Skyline Query Algorithm In MapReduce Framework