Research On Data-Aware Scheduling Strategies Of MapReduce Jobs

Posted on:2013-02-12

Degree:Master

Type:Thesis

Country:China

Candidate:Q W Fu

Full Text:PDF

GTID:2248330371483052

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the development of computer technology and the popularity of informationtechnology, all walks of life to produce the vast amounts of data every day, and the amount ofdata growths at the explosive speed, according to IDC statistics, the newly generated amountof data worldwide is more than270000PB (in2010) every year. In recent years, large-scaledata processing has become the focus of the computer industry and academia. To deal withthe data-intensive jobs with huge amounts of data, the traditional high-performancecomputing platforms has not qualified for that; Handling data-intensive computing tasksrequires computing platforms should be great scalable, available and fault tolerance.At present, the release of MapReduce distributed processing model of Google Inc. andGFS distributed file system is the tool for handling data-intensive jobs. Been widely usedopen source implementation of MapReduce model and GFS, Hadoop, not only in industry butalso aroused the concern of academics. Hadoop cluster has not only good horizontalscalability, but the compute nodes in the cluster can use a normal machine, thereby greatlyreducing the cost of the hardware to set up Hadoop cluster. Meanwhile, Hadoop is great atfault tolerance and availability. Hadoop platform, lets more people can easily build alarge-scale data processing platform to analyze the data, and promotes the development oflarge-scale data processing platform technology.As a core part of the computing platform, the implementation of the job scheduler playsa decisive role for allocating and computing resources of the entire platform. For the researchof MapReduce job scheduling algorithms and scheduler, we have chosen the Hadoop platform.At present, the Hadoop common schedulers are: default FIFO scheduler, the scheduler formulti-user fair (Fairshare), Capacity scheduler for multi-queue multi-user, as well as thescheduler for a particular scene (eg, for operating the latest completed moment of scheduling).There are multiple Hadoop’s scheduler types, but rarely focus on enhancing the operatingefficiency. Hadoop is mainly used to handle data-intensive jobs and the entire system ofcomputing resources and data is stored together, in order to improve the efficiency of jobexecution, the main method is to minimize the transmission of data in the system, runningtasks in the node which the data on. The main work of this paper is Data-aware schedulingpolicy based on the resource forecast delay scheduling algorithm on the base of the Hadoopplatform, this algorithm can effectively improve the efficiency of the implementation of theHadoop jobs. For the jobs scheduling, one way is required for transferring data to the node whichrunning the task, another way is to assigned tasks to the node which the data on. MapReducejob is mainly used to deal with massive data, if using the first operation mode, a large numberof data transmission will inevitably result in a waste of computing resources; Meanwhile,asthe characteristics of MapReduce jobs, in order to improve operating efficiency ofMapReduce jobs, trying to move computing, not move data. Computing tasks assigned to thecomputing nodes that contain data to be processed can be described as the task of localizationcalculation of the Task Locality.In this paper, the mean work is Data-aware scheduling policy based on the MapReducejob in Hadoop platform. Combining with the FIFO scheduling algorithm and FairShare Delayscheduling algorithm this article put forward the predict-resource based Delay algorithm; Bystatistics the processing of jobs and the condition of system in real-time, the Delay strategydynamically forecasts the availability resources of the system, and as a basis for jobscheduling, not only to increase the proportion of localized computing Map task and to reducethe waste of computing resources by unreasonable waiting of jobs; The Delay strategy basedon resource projections is more reasonable than the Delay strategy of FairShare scheduler, thescheduling algorithm can effectively improve the operating efficiency of the implementation.The experiments show that, in the general scenario, compared with the Fairshare scheduler.the scheduling algorithm in this article can improve the efficiency of the averageimplementation of jobs with28.8%. Based on this, combining this scheduling policy and thelatest completed time, we achieved the Deadline scheduler; The user sets the latest finish time,the scheduler not only to ensure that complete the job before the Deadline but also to improvethe average execution efficiency of jobs.

Keywords/Search Tags:

MapReduce job scheduling, Hadoop job scheduler, the Data-Aware SchedulingAlgorithm, resource forecasting, resource scheduling

PDF Full Text Request

Related items

1	Study On Resource Context And Job Cost-Aware Job Scheduling Optimization For Hadoop Mapreduce Framework
2	Thermal Aware Scheduling in Hadoop MapReduce Framework
3	Research On Resource-aware Skew Mitigation For Mapreduce
4	Research And Improvement Of Resource Scheduler Algorithm Based On Hadoop
5	Researches On Optimization Of Resource Allocation For MapReduce Scheduling
6	The Research On Distributed Task Scheduling Algorithms Based On Hadoop Platform
7	Research Of Self-Learning Resources Scheduler Model Based On The Hadoop System
8	Design And Implementation Of Hadoop Resource-aware Scheduler
9	The Research Of Job Schedulin Algorithms Based On Resource-aware For Hadoop Platform
10	The Research Of MapReduce Job Scheduling Algorithm Based On The Hadoop Platform