Font Size: a A A

MDC-Hadoop:Mapreduce Task Scheduling On Heterogeneous Geo-distributed Data Centers

Posted on:2019-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:F C ChenFull Text:PDF
GTID:2428330596960887Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Task scheduling is an important factor influencing the performance of big data analysis and has become a research hotspot in MapReduce scheduling in recent years.In recent years,largescale data-intensive computing needs have become more and more important.In High Energy Physics(HEP),the annual data generated by the Large Hadron Collider(LHC)is stored in 140 computing centers in more than 40 countries.Due to the distribution characteristics of data,it is not practical to aggregate the data into one data center for data analysis.Data transmission between data centers has become an important factor affecting the MapReduce task scheduling.Considering deadline,data locality,intermediate data processing and load balancing,this paper studies the MapReduce task scheduling problem in geo-distributed heterogeneous data centers,which has important theoretical significance and application prospects.This paper considers the problem of MapReduce task scheduling under geographically distributed heterogeneous data centers.Firstly,it improves the G-Hadoop framework,and establishes three stages of mathematical models for the characteristics of the Map,Shuffle,and Reduce phase.Then,the optimization objective and constraints are given.Finally,a MapReduce task scheduling algorithm on geo-distributed heterogeneous data centers is proposed.The algorithm is mainly divided into three parts: task scheduling in the Map phase,Reduce data center selection,and task scheduling in the Reduce phase.In each heartbeat,according to the available slot,task queues are constructed according to job and task sequencing rules in the Map phase and the Reduce phase.Task scheduling in the Map phase mainly considers the data locality,and it performs task scheduling with the principle of minimizing the total data locality costs.Considering the intermediate data processing time and the estimation Reduce phase duration time cost,the Reduce data center selection will select a suitable data center for each job that completes the Map phase,taking into the data center load balancing account.The task scheduling in the Reduce phase allocates tasks to minimize the total task execution time.In order to verify the efficiency and effectiveness of the proposed algorithm,variance analysis technique is used to analyze the parameters and components of the algorithm to get the most suitable parameter values and components;The proposed algorithm and comparison algorithm are compared and analyzed on the instances of different data center node scales and job scales.Experimental results show that the proposed algorithm outperforms the comparison algorithm in different data center node scales and job scales.
Keywords/Search Tags:MapReduce, task scheduling, geo-distributed, data centers
PDF Full Text Request
Related items