The Research On Distributed Task Scheduling Algorithms Based On Hadoop Platform

Posted on:2013-02-17

Degree:Master

Type:Thesis

Country:China

Candidate:J Q Zhou

Full Text:PDF

GTID:2248330395984788

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the information technology, Internet has become an indispensable part of people’s life. In the background of network information explosion, massive data processing becomes a new challenge in computer science. MapReduce is a distributed data processing programming model, its advantage lies in the simplification of the traditional distributed program development, and developers just need to focu on business logic program without thinking about the details of the distributed implementation. Hadoop is the open source of MapReduce, and it provides a data processing foundation platform for enterprise and research institutions of the massive data processing. The researching of MapReduce scheduling algorithm is mainly to solve the problems of cluster sharing, resources utilization and job’s response time etc. Meanwhile, with the increasing of users’ real-time requirement, the researching on the MapReduce real-time scheduling are more and more. The difficulty of MapReduce real-time scheduling is the real-time scheduling model, which should consider the cluster’s heterogeneity and data locality. The tasks’ remaining time prediction is a mainly part of real-time scheduling and it is often influenced by the cluster’s heterogeneity.By studying the job runtime mechanism of Hadoop, this paper proposes a Self-Adaptive Reduce Scheduling (SARS) algorithmn. In the current research on MapReduce scheduling, the Reduce task’s scheduling time is too simple. Reduce task scheduling time directly influences the completion time of the task and the utilization of cluster. The SARS scheduling algorithm can decide the Reduce tasks’time by the job’s own properties. Experiment results show that SARS reduce the jobs’Reduce tasks’completion time and the cluster jobs’mean response time. It also improves the utilization of cluster resources.Given the heterogeneity of the cluster, this thesis proposes a node classification algorithmn based on computing capacity, to classify the cluster’s nodes which have distinct computing capacity. Based on the node classification algorithmn, it proposes a scheduling algorithm MTSD (MapReduce Task Scheduling for Deadline constraints). The MTSD includes a tasks’remaining time model to evaluate remaining time and it deduces a resources requirement model in real-time job scheduling. Experiment results show that MTSD algorithm improves the data locality, and also performs well on the demand of job real-time.

Keywords/Search Tags:

Large scale processing, MapReduce, Hadoop, Schedulingalgorithm, Reduce scheduling, heterogeneity, data locality

PDF Full Text Request

Related items

1	Research On Methods Of Performance Optimization And Energy Saving In Big Data Processing System
2	Research On Data-Aware Scheduling Strategies Of MapReduce Jobs
3	Research On Efficient Task Partition And Scheduling In MapReduce Data Processing System
4	The Research On High Performance Task Scheduling Technology Based On Mapreduce In Cloud Computing
5	Research And Application Of Job Scheduling System In MapReduce Based On Hadoop
6	Task Scheduling Research And Application Of Big Data In Distributed Environment
7	Hadoop Job Scheduling Research And Optimization About Data Locality
8	The Research Of Job Scheduling Algorithm In Mapreduce-styled Massive Data Processing Platform
9	Optimization And Research On Reduce Task Scheduling Strategy And Data Skew On Hadoop
10	Research On Distributed SVM Algorithm Based On Hadoop Platform