Font Size: a A A

Research On Optimization Of Mapreduce Job Scheduling Technology

Posted on:2016-03-30Degree:MasterType:Thesis
Country:ChinaCandidate:D Q LiangFull Text:PDF
GTID:2348330503977884Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, data, which has penetrated into every area of today's industries and business areas, has become an important factors of production. The amount of data generated by the Internet daily has gone far beyond the carrying capacity of existing IT infrastructure; also the requirement of real-timing has been beyond the existing computing power. As a cloud computing data processing system, Hadoop uses the idea of data parallel computing to process large data and has been widely applied in many fields. Most existing MapReduce job schedulers do not consider deadline reqirements of jobs, leading to a part of the jobs can not be completed in time; besides, most job schedulers use a "best effort" policy in job's localized execution, leading to The result that the job set is not able to take full advantage of data locality and network transmission cost becomes a bottleneck of efficiency. In addition, most job schedulers do not consider the heterogeneity of clusters and are not able to select the reasonable compute nodes to run jobs based on local conditions, resulting in jobs' inefficient execution.In response to these problems, in this paper, we proposed a 2-tier scheduling algorithm MCF aimed at improving the execution efficiency of data-parallel jobs.In the first level scheduling, MCF establishes a multi-user waiting queue, pre-assigns resources (storage/compute/bandwidth) for jobs based on deadlines, estimates the remaining time of jobs, minimizes the average delay time and provides basis for more fine-grained tasks assignment;In the second level scheduling, MCF combines tasks into task groups to accelerate the scheduling efficiency based on the data block's location information.Then MCF establishes a waiting time model and an execution time model for jobs considering data locality and cluster heterogeneity.In the end, MCF generates scheduling sequences for tasks using a strategy based on the minimum cost flow, trying to reduce the average response time of the whole job set.We designed and implemented the MCF scheduling algorithm on the high performance computing center to achieve the above target. The experiment results shows that, MCF can effectively reduce the average response time of job set, decrease the average delay time, and has certain performance advantages compared with FIFO, Capacity and Fair Scheduler.
Keywords/Search Tags:big data processing, MapReduce, job scheduling, minimum cost flow
PDF Full Text Request
Related items