The Optimization Of High Performance MapReduce FairScheduler And The Implementation On Simulator Of Huge Scale Cluster

Posted on:2013-02-12

Degree:Master

Type:Thesis

Country:China

Candidate:X M Pan

Full Text:PDF

GTID:2218330371458921

Subject:Computer technology and applications

Abstract/Summary:

PDF Full Text Request

The current internet is facing on the difficult problems of PB level's data storage and computation. Hadoop is being growingly used to deal with massive data distributed storage and computation, is an easily expansive distributed computing framework, connect with the cheap PC node together to provide services of storage and computing. Its MapReduce framework provides user a easier programming model to do parallel processing of large scale data. At this background, this paper did deep analysis on the working principle and mechanism of MapReduce and Hadoop's master-slave architecture.Facebook designs and implements the fair scheduler for MapReduce, based on its cluster's feature:large scale, small jobs occupy a large percentage, require a fast job response. But in the large scale cluster mixed with long-time batch jobs and short-time interactive jobs, fair scheduler'performance is seriously weakened, thus it's universality is not high. Based on Facebook's fair scheduler, this paper analyzed its performance bottlenecks, and did specific optimization.(1) lazy scheduling (2) Shuffle independence(3)multiple task assignment(4)oob heartbeat, and others. So as to solve the problems of data locality is bad and reduce slot's resource utilization is slow and improve the scheduler's performance of both response time and throughput.According to the real online production cluster, this paper designes and implements a simulator of huge scale hadoop cluster, and did the function and performance verification test. Based on the simulator, a 2000 node cluster is simulated to do the FIFO, FairScheduler and new fair scheduler's comparison test. The result shows that new fair scheduler can make jobs competes more fairly in the complex huge scale cluster, and the whole cluster throughput rate is improved with average 25%, the highest 40% and the average response time of single job is increased by 5%～25% than the one before optimization.

Keywords/Search Tags:

Distributed Computing, MapReduce, Hadoop, Shuffle Independence, Simulator, Parallel Submittor, huge cluster, lazy scheduling

PDF Full Text Request

Related items

1	Design Of Mapreduce Task Scheduling Algorithms In Heterogeneous Hadoop Cluster
2	Research On Hadoop Cluster Scheduling Optimization
3	Research Of Optimization Of Hadoop MapReduce Shuffle Phase
4	An Optimized MapReduce Workfow Scheduling Algorithm For Heterogeneous Computing
5	Research And Implement Of Job Scheduling Method For Multi_user Mapreduce Clusters
6	Research And Implement Of Job Scheduling Method For Multi_User MapReduce Clusters
7	The Design And Implementation Of Parallel Computing Platform Based On MapReduce
8	Research On Optimization Of Map Reduce For Interactive Analysis On Big Data
9	The Research Of Improving Performance Of Hadoop Cluster
10	Research On Scheduling Algroithm In Hadoop Mapreduce