Font Size: a A A

The Research On Distributed Task Scheduling Algorithms Based On Hadoop Platform

Posted on:2013-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:J Q ZhouFull Text:PDF
GTID:2248330395984788Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the information technology, Internet has become an indispensable part of people’s life. In the background of network information explosion, massive data processing becomes a new challenge in computer science. MapReduce is a distributed data processing programming model, its advantage lies in the simplification of the traditional distributed program development, and developers just need to focu on business logic program without thinking about the details of the distributed implementation. Hadoop is the open source of MapReduce, and it provides a data processing foundation platform for enterprise and research institutions of the massive data processing. The researching of MapReduce scheduling algorithm is mainly to solve the problems of cluster sharing, resources utilization and job’s response time etc. Meanwhile, with the increasing of users’ real-time requirement, the researching on the MapReduce real-time scheduling are more and more. The difficulty of MapReduce real-time scheduling is the real-time scheduling model, which should consider the cluster’s heterogeneity and data locality. The tasks’ remaining time prediction is a mainly part of real-time scheduling and it is often influenced by the cluster’s heterogeneity.By studying the job runtime mechanism of Hadoop, this paper proposes a Self-Adaptive Reduce Scheduling (SARS) algorithmn. In the current research on MapReduce scheduling, the Reduce task’s scheduling time is too simple. Reduce task scheduling time directly influences the completion time of the task and the utilization of cluster. The SARS scheduling algorithm can decide the Reduce tasks’time by the job’s own properties. Experiment results show that SARS reduce the jobs’Reduce tasks’completion time and the cluster jobs’mean response time. It also improves the utilization of cluster resources.Given the heterogeneity of the cluster, this thesis proposes a node classification algorithmn based on computing capacity, to classify the cluster’s nodes which have distinct computing capacity. Based on the node classification algorithmn, it proposes a scheduling algorithm MTSD (MapReduce Task Scheduling for Deadline constraints). The MTSD includes a tasks’remaining time model to evaluate remaining time and it deduces a resources requirement model in real-time job scheduling. Experiment results show that MTSD algorithm improves the data locality, and also performs well on the demand of job real-time.
Keywords/Search Tags:Large scale processing, MapReduce, Hadoop, Schedulingalgorithm, Reduce scheduling, heterogeneity, data locality
PDF Full Text Request
Related items