Font Size: a A A

Research And Improvement Of Job Scheduling Algorithms On Hadoop Platform

Posted on:2016-04-04Degree:MasterType:Thesis
Country:ChinaCandidate:B R CaoFull Text:PDF
GTID:2298330467989522Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of search engines, social networks and other data-intensive Inter-net application, there’s a trend that the information and data is growing explosively. Thusarises the question of how to store and process more and more huge data information, theanswer is the cloud computing. As the commercial implementation ofgrid computing and dis-tributed computing,the cloud computing integrates multiple computing entities into a rela-tively strong cluster through the network. It has some features including super-large scale,high reliability and high scalability, etc. Its core idea is forming a resource pool through uni-fied scheduling and allocation of resources, which is allocated to each user according to hisneed.In numerous solutions of cloud computing, Hadoop is a particularly important cloudplatform architecture. It is an open source implementation of GFS data storage mode andMapReduce distributed programming model which both are presented by Google, which isimplemented by building cheap clustering basing on the distributed file system on. Its coretechnologies HDFS and MapReduce achieve the function of mass data storage and processingrespectively. The problem of how to improve the performance of MapReduce by designingdifferent job scheduling algorithm has become a hot spot in academia and industry. This papermainly studies the delay scheduling algorithm and the LATE scheduling algorithm, which areimproved.Delay scheduling algorithmis put forward in order to solve the localization problem ofdata. Its core idea is the spare node of application process chooses in priority the job in whichthere’s data to be processed on the node in the job queue, if it still does not find the local jobwithin the prescribed period of time, then select the job at the head of the queue. This methodhas greatly increased the probability of job performed locally, but also produced some prob-lems. If the data to be processed of one job is set on certain nodes, it has greatly increased theload of these nodes, which is easy to cause the cluster load imbalance, and influence the ex-ecution efficiency. In this paper, the delay scheduling algorithm is improved, specific as fol-lows: First is to balance the load, detect the load of spare nodes while the jobs is waiting forlocal target node. If the node load exceeds the threshold, assign no task temporarily; Second isto increase the number of copies of hot data. Different data block has different file accessquantity, set different number of copies according to the heat of different data block files.Achieve the purpose of improving the operating efficiency of the Hadoop cluster through us- ing the two methods above.LATE scheduling algorithmis a kind of method proposed for different performing speedof nodes under the environment of heterogeneous cluster, which allows fast nodes performabove slow nodes, so as to shorten the whole running time of the cluster. Now that the origi-nal LATE scheduling algorithm is lack of consideration of local rack and non-local rack andthe cluster load in operation, this article integrates the two aspects to improve. In speculatingthe remaining time of job execution, considering the time cost of the data migrated tonon-local rack from the local rack, sort by the weight, which is the sum of the remaining timeand the migration time, and analyze with the load of cluster nodes in consideration, so as toavoid assigning tasks when the node is in overload state, reduce the job running time, improvethe running efficiency of the cluster.By experimental verification, the improved delay scheduling algorithm shortens the av-erage response time of job operation than the primary, improves the efficiency of the cluster.The improved LATE scheduling algorithm is more accurate in judging backward tasks thanthe primary, more reasonable in speculating job operation which crosses the racks, improvethe utilization rate of the cluster.
Keywords/Search Tags:Hadoop, MapReduce, Delay scheduling, LATE scheduling algorithm, Loca-liy, Load balance
PDF Full Text Request
Related items