Font Size: a A A

The Optimization Of High Performance MapReduce FairScheduler And The Implementation On Simulator Of Huge Scale Cluster

Posted on:2013-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:X M PanFull Text:PDF
GTID:2218330371458921Subject:Computer technology and applications
Abstract/Summary:PDF Full Text Request
The current internet is facing on the difficult problems of PB level's data storage and computation. Hadoop is being growingly used to deal with massive data distributed storage and computation, is an easily expansive distributed computing framework, connect with the cheap PC node together to provide services of storage and computing. Its MapReduce framework provides user a easier programming model to do parallel processing of large scale data. At this background, this paper did deep analysis on the working principle and mechanism of MapReduce and Hadoop's master-slave architecture.Facebook designs and implements the fair scheduler for MapReduce, based on its cluster's feature:large scale, small jobs occupy a large percentage, require a fast job response. But in the large scale cluster mixed with long-time batch jobs and short-time interactive jobs, fair scheduler'performance is seriously weakened, thus it's universality is not high. Based on Facebook's fair scheduler, this paper analyzed its performance bottlenecks, and did specific optimization.(1) lazy scheduling (2) Shuffle independence(3)multiple task assignment(4)oob heartbeat, and others. So as to solve the problems of data locality is bad and reduce slot's resource utilization is slow and improve the scheduler's performance of both response time and throughput.According to the real online production cluster, this paper designes and implements a simulator of huge scale hadoop cluster, and did the function and performance verification test. Based on the simulator, a 2000 node cluster is simulated to do the FIFO, FairScheduler and new fair scheduler's comparison test. The result shows that new fair scheduler can make jobs competes more fairly in the complex huge scale cluster, and the whole cluster throughput rate is improved with average 25%, the highest 40% and the average response time of single job is increased by 5%~25% than the one before optimization.
Keywords/Search Tags:Distributed Computing, MapReduce, Hadoop, Shuffle Independence, Simulator, Parallel Submittor, huge cluster, lazy scheduling
PDF Full Text Request
Related items