Study On Performance Optimization Of MapReduce

Posted on:2017-02-07

Degree:Master

Type:Thesis

Country:China

Candidate:G Wang

Full Text:PDF

GTID:2308330482490598

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet, the thing of internet and mobile internet, massive data is produced every day. The growth of data is explosive, which it indicates that the era of big data is coming. The data in big data era is massive and complicated which makes storing and computing more difficult. The MapReduce designed by Google simplifies the problem of distributed computing and gets a lot of attention. So researching on MapReduce and its optimization have practical significance.MapReduce is a distributed computing model and it simplified data processing on large clusters, so it is widely used in the big data area. Hadoop is an open-source implementation of MapReduce, which has the ability of processing big data. However, some mechanism influence the performance, for example, the default partition can’t guarantee the load balance and the performance will be influenced. Consequently, there is necessary to optimize the mechanism.The main research on this thesis is following. Firstly, we are going to simply introduce the Hadoop platform, and then we will put our attention on the MapReduce computing model. We will focus on analyzing the key components of MapReduce and internal operation mechanism in MapReduce. Based on analyzing the source code, this thesis presents the problem of load unbalance when data is skewed and the problem of low efficiency using the default speculative execution in heterogeneous environments. When data is skewed, this thesis designs a new partition method which learns the distributed information of intermediate result using sampling and takes data locality into account. This solution can make load balanced. Facing the problem of low efficiency because of using speculative execution, this thesis provides an improved LATE algorithm. Based on LATE algorithm, using the historical information of nodes and data locality, this algorithm can find straggler more exactly and increase the throughput rate.Finally, we build the experimental platform to testify the sample partition and advanced LATE algorithm. The experiment results show that the algorithm can effectively improve the performance of MapReduce.

Keywords/Search Tags:

big data, MapReduce, sample, speculative execution, load balance

PDF Full Text Request

Related items

1	Research On Load Optimization Of Mapreduce Resource Scheduling Mechanism In Heterogeneous Environments
2	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
3	MapReduce Speculation Execution Algorithm In Heterogeneous Environments
4	Key Value Based Algorithm For Solving Reduce Load Imbalance In Mapreduce
5	Task Scheduling Optimization Based On Time And Load Balance Under The Hadoop Platform
6	Research On MapReduce Scheduler For Iterative Applications
7	Load Balancing Algorithm Based On Data Skew Of MapReduce
8	Research On Improving The Fault Tolerance Performance In MapReduce
9	Research On Scheduling Algroithm In Hadoop Mapreduce
10	Research On Load Balance Of Hadoop Cloud Computing Platform