Font Size: a A A

Research On Key Issues Of Task And Job Scheduling For MapReduce Clusters

Posted on:2015-04-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:T GuFull Text:PDF
GTID:1228330467965584Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of information technologies, the requirements of data processing are growing quickly, which proposes high demands for techniques of data processing. By connecting lots of machines as nodes, the MapReduce cluster provides cheap and easy to use computing capabilities, which satisfies current demands of data processing. The MapReduce cluster schedules and executes tasks in parallel automatically, so problems related to scheduling have a great effect on its performance. Researches on these issues have important theoretical and practical significance. This dissertation focuses on key issues including analyzing the process of MapReduce in theoretical, the resource allocation for sequential jobs, the overhead of data transmission and problem of load balancing. The main research contents of this dissertation are listed as follows:First of all, the processes of tasks’ scheduling and execution in a MapReduce job are analyzed with Divisible Load Theory (DLT). An independent communication model with multi-source, which fits the practical situation in MapReduce clusters better, is proposed. By using the characteristics of map and reduce tasks, the data transmission between nodes and processes of computing task are analyzed. Then the MapReduce is modeled with DLT and proposed communication model, and the optimal distribution of input data and scheduling of MapReduce loads for minimum execution time are presented with linear programming. The task scheduling schemes are solved and the performance of task execution are evaluated in typical MapReduce clusters, which provides instructions for task scheduling and distribution of input data.Secondly, the issue of scheduling sequential MapReduce jobs is studied. Under the assumption that sequential MapReduce jobs’ arrivals according to a Poisson process, the computing process of sequential MapReduce jobs is analyzed. The relationship matching nodes’ computing capabilities and jobs’ computing cost is established. With the distribution function of jobs’ computing cost, the job scheduling scheme that minimizes jobs’average computing time is given theoretically. A new job scheduling algorithm called median scheduling is proposed. Simulation results show that the median scheduling algorithm decreases the jobs’average computing time effectively.Next, the research used to decrease overhead of data transmission is carried out. Based on analyzing the data transmission in map phase, the situations under different network loads are studied. For the overloaded network, the size of data that storage nodes should export during tasks’execution is analyzed. Then a charity scheduling method is proposed based on that. For the low-loaded network, the data transmission and task computing in a MapReduce job is analyzed. A data prefetching mechanism is proposed to hide the cost of data transmission sufficiently. The results of experiments show that the charity scheduling method improves the proportion of tasks with local input data, and the data prefetching mechanism decreases the time nodes used to get corresponding data.Finally, the load balancing in scheduling reduce tasks is researched. The problem of load balancing in heterogeneous cluster is presented and the heterogeneity of reduce tasks’granularities is analyzed. A model of nodes’communication capabilities is established based on characteristics of communications between nodes. Considering nodes computing and communication capabilities uniformly, the nodes’ capabilities for executing reduce tasks is got. Based on the idea that matching tasks with nodes’capabilities, a scheduling method for reduce tasks is proposed. The results of simulation experiments show that the proposed method achieves better load balancing and decreases execution time effectively. The proposed method is also verified to be adaptive to dynamic nodes’capabilities.This dissertation focuses on several key issues related to the scheduling in the MapReduce cluster. Aiming to analyze and optimize the performance of MapReduce, the task scheduling and sequential MapReduce jobs’scheduling are researched in theoretical. Several scheduling methods including the median scheduling algorithm, the charity scheduling method and a scheduling method for reduce tasks are proposed. Results of experiments show that researches in this dissertation improve the performance of MapReduce effectively. These researches provide a platform for the MapReduce’s analysis. Related conclusions and results can be referenced by future researches of MapReduce.
Keywords/Search Tags:MapReduce Cluster, Task and Job Scheduling, Sequential MapReduceJobs, Overhead of Data Transmission, Load Balancing
PDF Full Text Request
Related items