Font Size: a A A

Research On Scheduling Algorithms In Hadoop Clusters

Posted on:2013-06-01Degree:MasterType:Thesis
Country:ChinaCandidate:H Y KeFull Text:PDF
GTID:2298330422479932Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of network and application, cloud computing is developed and currentlybecoming one of the hottest research area. Hadoop is a computational framework for cloud computingand is suitable for handling large data sets. In Hadoop, jobs submitted by users are divided into anumber of independent tasks and then scheduled by the scheduler to execute in parallel on differentcompute nodes.Scheduling problem is always one of the most important issues in parallel computing. Due to theincreasingly complexity of application environment, traditional scheduling algorithm is not suitable inHadoop. On the other hand, Fair Scheduling algorithm can solve the problem of how to share thecluster resources in multi-user environment properly, it now has been accepted in Hadoop. However,in the cloud computing based on Hadoop framework, there is a data migration problem, which ariseswhen computing resources and data resources are at different physical locations, increasing thenetwork I/O, to produce the so-called data locality problem.To solve this problem, researches have proposed Delay Scheduling algorithm, in which a task tobe executed will be delayed for a period of time, until a node, who has the data resource required bythe task, ask the Job Scheduler for a task.In this paper, the Delay Scheduler algorithm is firstly investigated. Then, after analyzing thedisadvantage of Delay Scheduler algorithm, two improvements are put forward. Also, the experimentsare carried out to verify the improved algorithms. The details are followings:1) Research on how to set a reasonable delay time interval for Delay Scheduling. In practicalapplication, delaying time interval is often an experience value, setting too long or too short mayaffect system performance and execution efficiency. Based on the analysis of how the distribution ofdata to be processed in the file system affect job localized scheduling, the paper introduces usersexpect localization probability to derive calculation formula of waiting time. Using the formula,different waiting time is set according to different jobs, thus users can control the expectedlocalization degree based on the desired probability parameters. Experiment is carried out to verify theabove method, and the result shows that the delay time calculated by the formula can make theoperation reach the localization level that users expected.2) Research on how to set a target computing node properly. As stated above, Delay Schedulingalgorithm is designed to solve data localization problem, so target computing node of task is determined by data location. However, when data concentrated in certain nodes, there may happenthat multiple tasks run on the same node, resulting in poor parallelization. In this paper, to solve thisproblem, the Delay-Capacity Scheduler algorithm is proposed on the basis of Delay Scheduleralgorithm. It allows some tasks run on the node that did not contain their input data, so as to decreasethe job response time and improve the degree of job parallelization. To realize this algorithm, Hadoopsource codes are modified and recompiled to build the test environment. The experiment is alsocarried out and the result shows that the improved algorithm is obviously superior to the originaldelay scheduling algorithm in both efficiency and parallelization effect.3)Applied the above two algorithms in Electric Network Monitoring System for parallel formulacalculation. This paper first analyzes characteristics of formula calculation in Electric NetworkMonitoring System, and then discusses how to realize parallel processing of formula calculation withMapReduce programming model and how to schedule tasks using the above two algorithms. Finally,it conducts a comparison of the formula calculation results between Hadoop cluster environment andtraditional cluster environment, the result shows that the completion time and load balance are betterby using the method in this paper.
Keywords/Search Tags:Abstracts, Delay Scheduling, delay time interval, target computing node
PDF Full Text Request
Related items