Font Size: a A A

Research On Task Scheduling Algorithms In MapReduce Clusters

Posted on:2016-12-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:X T WangFull Text:PDF
GTID:1318330542987063Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years,the Internet technology has been developed rapidly,and there emerge varies of network-based applications,such as E-mail and microblog,which bring a lot of convenience for people.At the same time that hundreds of millions of users browse information via the Internet,a huge amount of data are generated continuously,such as electronic trading records,user access logs,etc.Besides,many large enterprises and organizations also generate a mass of data.Examples include the stock information from stock exchanges,the oceanic data from monitoring stations,etc.In the face of such large data sets,how to efficiently process large-scale data sets and dig out valuable information becomes a very hot problem concerned by lots of IT enterprises and scholars.MapReduce is a parallel computing framework designed for large-scale data analysis.The system has been recognized by many computerists due to its desirable processing ability and high availability.Task scheduling is the kernel in MapReduce and it directly affects the performance of MapReduce.Therefore,in this article,we make an in-depth research on the task scheduling problem in MapReduce,and propose several efficient scheduling algorithms for a variety of application scenarios.The main contributions are summarized as follows.(1)For the MapReduce jobs with deadline requirements,a Scheduling Algorithm based on the Most Effective Sequence(SAMES)is proposed.1)A Sequence-based Task Execution Strategy(STES)is proposed for the deadline-constraint jobs.Based on STES,the concept of the Effective Sequence(ES)is proposed.2)For jobs in the MapReduce cluster,an efficient method is designed to generate the corresponding ES rapidly.If more than one ES exists,we use reasonable standards to choose the Most Effective Sequence(MES)as the base of scheduling.3)When new jobs arrive,an incremental method is designed for the MES updating.4)In large-scale clusters,some exceptions are unavoidable.Thus,an exception handling method is proposed to improve the robustness of SAMES.At last,we use a series of experiments to testify the performance of SAMES for the deadline-constraint jobs.(2)We propose a scheduling Algorithm for the Maximum Benefit(AMB)in MapReduce.1)We propose the maximum benefit problem for the first time.Specifically,if a job is accepted by the system and finished on time,some benefits are acquired.In contrast,an overtime job will cause financial losses.Therefore,given a job set,we need to accept suitable jobs and use a reasonable approach to process these jobs,in order to maximize the benefits.2)To solve the problem above,the AMB algorithm is proposed.First,for the static job set,AMB rapidly determines which jobs can be accepted and gives a reasonable scheduling plan.Second,for the dynamic job set,AMB uses an incremental method for the acceptable judgement and updating the scheduling plan.Third,a timeout handling method is designed to reduce the provider's loss caused by exceptions.Finally,the effectiveness of AMB is testified through plenty of experiments.(3)For a multi-user shared MapReduce cluster,a Throughput Driven scheduling algorithm(TD)is proposed.1)Based on the parameters of jobs and the system resources,the jobs are classified into 6 states,and the method for state conversion is also designed.2)By making an in-depth analysis on the main factors that affect the system throughput,the methods for job selection and task assignment are proposed,which can significantly improve the proportion of local task assignment and reduce the network communication cost.At last,the performance of TD scheduler is evaluated using lots of experiments.(4)We propose a Throughput Driven scheduling algorithm for a Heterogeneous MapReduce cluster(HTD).1)For single job in a heterogeneous cluster,we design the task assignment method and estimate the parameters of this job.2)For the job set in the cluster,the optimal execution order is generated according to their parameters obtained in the first step.3)The optimal execution order is further adjusted to satisfy heterogeneous environments,and then the final scheduling plan is acquired.At last,the validity of HTD is proven using a number of experiments.
Keywords/Search Tags:MapReduce, task scheduling, deadline, the maximum benefit, throughput, heterogeneous cluster
PDF Full Text Request
Related items