Font Size: a A A

Research On Task Scheduling Of Geo-distributed Big-data Processing Jobs

Posted on:2020-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:X B LiFull Text:PDF
GTID:2428330575458320Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,public cloud services have become more and more mature and common.The public cloud service provides a new business service mode,which is user-centered,resource-transparent to users,and "pay on demand,pay on volume".It saves users the cost and energy of self-organizing and maintaining server clusters.Many small and medium-sized enterprises or organizations which are more sensitive to human,material and time costs choose to purchase cloud servers on public cloud platform to build their own clusters.At the same time,with the wide use of large data processing technology,a new application scenario has emerged in recent years:geo-distributed big data processing,that is,processing big data which distributed in different geographical clusters.Ac-cording to their business types or regions,many companies or organizations need to buy cloud servers in different regions to build clusters to handle the corresponding business.These clusters generate a large amount of data every day and need to be pro-cessed uniformly.For these jobs whose input data distributed in different regions,we call them geo-distributed big data processing jobs.To process large volumes of geo-distributed data,the traditional method is still centralized fashion,that is to transfer these data to a single cluster and then analyze it.But this approach is not always feasible.In addition to the time and money costs of large amounts of original data transmission,the more important problem is that some countries or regions will prohibit the transmission of users'original data to other coun-tries in order to protect data privacy.Therefore,in order to avoid moving original data,it is necessary to adopt a non-centralized approach to deal with these geo-distributed data,but at the same time,many problems need to be solved.Firstly,the inter-regional transmission of intermediate data of jobs is unavoidable,and the WAN link of public cloud platform will charge for data transmission.Although the intermediate data is generally smaller than the original data,it will also generate a certain data transmission cost,this is a concern for cost-sensitive users.Secondly,geo-distributed big data pro-cessing job's completion time is more vulnerable to WAN link bottlenecks,differences among different clusters and changes in the environment within the cluster.There-fore,considering the above problems,how to deal with geo-distributed big data more reasonably is an urgent problem to be solved.This paper considers the problem of task scheduling for big data jobs on geo-distributed data,considering the budget constraints on intermediate data trans-regional transmission,and without moving the original data.The main contribution of this paper includes:1.A cost-constrained task scheduling strategy RTSG for multi-stage Spark-type jobs is proposed.It includes:a budget allocation strategy which can adjust the data trans-mission budget of each stage adaptively,a estimation method of stage completion time,and a task scheduling strategy with data transmission cost constraints.2.On the basis of RTSG scheduling strategy,we further considering the impact of dynamic change of available computing resources on geo-distributed data process?ing job,and propose a resource-aware task scheduling strategy R-RTSG for geo-distributed data processing jobs.It includes:a forecasting method for available computing resources and resource applications in cluster based on Markov chain model,a resource reservation strategy for geo-distributed data processing job,and a dynamic task scheduling strategy that can adjust the results of task assignment in real time according to the running status of each cluster.3.Because the existing big data processing systems are not suitable for geo-distributed data processing jobs directly,we design and implement a system for geo-distributed data processing jobs based on Spark and Hadoop.It can run geo-distributed data processing jobs directly without modifying existing traditional jobs.The simulation experiment and system experiment show that,compared with S-park's default task scheduling mechanism and some existing work,RTSG and R-RTSG have shown better results,and have played a better role in reducing data transmission costs and shortening job completion time,and they are very helpful to minimize geo-distributed data processing job's completion time and reduce data transmission cost.
Keywords/Search Tags:cloud computing, big data, geo-distributed data processing, task scheduling
PDF Full Text Request
Related items