Font Size: a A A

Study On Computing Task Scheduling Optimization Based On Hadoop Job

Posted on:2017-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:X XiaoFull Text:PDF
GTID:2308330485488111Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the information science, the internet is becoming increasingly connected with our society in all aspects of life and the information data generated within it grows at an exponential rate. In the face of these massive data, the traditional computing model could hardly meet the current data processing requirements. As an intersectional product of the traditional computing model and the network technology, the cloud computing could deal with these massive data efficiently in lots of distributed computing machines. Hadoop, as a distributed computing framework which could deal with this kind of large-scaled datasets in high-efficiency, has been adopted more and more frequently by institutions as the basic computing framework of the cloud computing platform. As the improvement of Hadoop’s execution efficiency has become a hot research topic, therefore the amelioration of the scheduling algorithm, a critical factor effecting the execution efficiency, is of high significance.Based on the existing optimization examples of Hadoop’s scheduling algorithm, it is not hard to find that most optimization algorithms focus on how to carry out the reasonable scheduling between multiple jobs, few works are about the computing task scheduling based on Hadoop job. In addition, the computing capacity of heterogeneous cluster nodes is not fully considered, or only according to the machine configuration to set a theoretical value, therefore it become disconnect with reality. This thesis is mainly aimed at the research of the problem of computing task scheduling based on Hadoop job, of which the main work includes the following two parts.First, we make an introduction to the background knowledge of this subject and the Hadoop components involved in the process of scheduling. Besides, we analyse the disadvantages of the default task scheduling algorithm of Hadop and function of the related classes and methods in the task scheduling process. This thesis proposes a data localization task scheduling algorithm based on Hadoop after analyzing the main idea, design thought, advantages and disadvantages of some improved scheduling algorithm at the present stage. The algorithm can help us calculate the saturation level of the node’s data localization and schedule the computing task according to the real computing performance of the node and the number of currently-stored data blocks which haven’t been processed. In the process of traditional task scheduling, data blocks stored in node have no distinction and randomly selected one data block at each schedule. In this thesis, we introduces the concept of data block label, and marks every data block in scheduling process, and then schedules the block according to the value of the label. The algorithm proposed in the thesis can improve the efficiency of computing task scheduling based on Hadoop job. In addition, combined with other multiple job scheduling algorithms, it can further improve the efficiency of Hadoop platform combined with other multiple job scheduling algorithms, and it can shows a good performance even in the heterogeneous cluster.Second,we take an experiment of the optimal task scheduling algorithm and the default task scheduling algorithm by building a Hadoop heterogeneous cluster as the experiment environment, then compares and analyses the experimental results. It shows that the optimal algorithm can increase the number of data localization computing task. Thus, it can reduce the network bandwidth usage, use the resource of system more effectively and cut down the whole job running time.
Keywords/Search Tags:Hadoop, data localization, task scheduling, heterogeneous cluster
PDF Full Text Request
Related items