Font Size: a A A

Study On Performance Optimization Of MapReduce

Posted on:2017-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:G WangFull Text:PDF
GTID:2308330482490598Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the thing of internet and mobile internet, massive data is produced every day. The growth of data is explosive, which it indicates that the era of big data is coming. The data in big data era is massive and complicated which makes storing and computing more difficult. The MapReduce designed by Google simplifies the problem of distributed computing and gets a lot of attention. So researching on MapReduce and its optimization have practical significance.MapReduce is a distributed computing model and it simplified data processing on large clusters, so it is widely used in the big data area. Hadoop is an open-source implementation of MapReduce, which has the ability of processing big data. However, some mechanism influence the performance, for example, the default partition can’t guarantee the load balance and the performance will be influenced. Consequently, there is necessary to optimize the mechanism.The main research on this thesis is following. Firstly, we are going to simply introduce the Hadoop platform, and then we will put our attention on the MapReduce computing model. We will focus on analyzing the key components of MapReduce and internal operation mechanism in MapReduce. Based on analyzing the source code, this thesis presents the problem of load unbalance when data is skewed and the problem of low efficiency using the default speculative execution in heterogeneous environments. When data is skewed, this thesis designs a new partition method which learns the distributed information of intermediate result using sampling and takes data locality into account. This solution can make load balanced. Facing the problem of low efficiency because of using speculative execution, this thesis provides an improved LATE algorithm. Based on LATE algorithm, using the historical information of nodes and data locality, this algorithm can find straggler more exactly and increase the throughput rate.Finally, we build the experimental platform to testify the sample partition and advanced LATE algorithm. The experiment results show that the algorithm can effectively improve the performance of MapReduce.
Keywords/Search Tags:big data, MapReduce, sample, speculative execution, load balance
PDF Full Text Request
Related items