Font Size: a A A

Design Of Mapreduce Task Scheduling Algorithms In Heterogeneous Hadoop Cluster

Posted on:2019-06-24Degree:MasterType:Thesis
Country:ChinaCandidate:M WangFull Text:PDF
GTID:2428330545460437Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid development of Internet applications has led to the era of information explosion.Most of such data are nonrelational,in either unstructured or semi-structured format,and the data volume,even generated on a daily basis,has gone far beyond the storage capacity and processing capability of any conventional stand-alone computer.Distributed computing coupled with cloud computing has proved to be an effective solution.In the past decade,several distributed computing frameworks have been developed,and used for big data processing.Map Reduce is one of the most popular frameworks on the Hadoop platform with distributed storage and parallel processing.As the data volume continues to increase,improving the performance of Hadoop clusters has become the focus of many big data applications.There are many factors affecting the Hadoop performance and the Map Reduce scheduler is a key component that determines the overall performance of a Hadoop cluster.In this thesis,we formulate and investigate a task scheduling problem in a heterogeneous Hadoop cluster to minimize the completion time of a batch of Map Reduce jobs.We first design a prediction model to predict the end time of a task,which is used for placing the corresponding data block on a node in advance to reduce the data transmission time and the overall job completion time.Based on this prediction model,we propose a heuristic Map Reduce task matching-based scheduling algorithm,referred to as TMSA,to schedule the tasks in the task queue in Hadoop,by taking into account the real-time performance of each node in the cluster and the matching degree between nodes and tasks to reduce job completion time on heterogeneous clusters.The contents of this thesis are summarized as follows:(1)This thesis formulates a task scheduling problem of Map Reduce,and proves it to be NP-complete.(2)To solve the data locality problem of map tasks,this thesis proposes a map task execution time prediction model to predict the end time of a task.The prediction model decides the node with the earliest completion time and is used by the scheduling algorithm to find the best-matched task for the node.With the predicted task execution time,we are able to transfer the task data to the execution node in advance to avoid runtime data transmission and hence reduce the overall job completion time.(3)To solve the MapReduce task scheduling problem,we propose a heuristic MapReduce task matching-based scheduling algorithm,referred to as TMSA,to schedule the tasks in the task queue in Hadoop,by taking into account the real-time performance of each node in the cluster and the matching degree between nodes and tasks.(4)The map task execution time prediction model and TMSA proposed in this thesis are tested and evaluated in small-scale real-life Hadoop clusters and a Cloud Sim-based simulation platform,respectively.For different test procedures,the experimental results show that compared with FIFO and DPMQS,TMSA reduces the job completion time by 28.3%,29.2% and 9.5%,13.7%,respectively.For the map task execution time prediction model,the accuracy of the prediction model is over 90% in the case of 5000 samples.
Keywords/Search Tags:Distributed Computing Framework, Hadoop, MapReduce, Yarn, Scheduler
PDF Full Text Request
Related items