Font Size: a A A

Research Of Task Partition And Resource Allocation Algorithms For Load Balance In Spark Computing Environment

Posted on:2018-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:Q Q LiFull Text:PDF
GTID:2348330542461657Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and the popularity of 3G phones,everyone is constantly producing data at any moment,directly led to rapid growth of voluminous data,which symbol the arrival of the era of big data.How to dig out the information of the large-scale enterprise data to produce business value,has become a new trend for the distributed cloud computing.However,with the expansion of the scale of the data,as well as application scene continues to expand,users put forward higher requirements on the cloud computing platform.Spark is proposed for processing large amounts of data distributed parallel computing programming models by AMPlab Comparing with MapReduce,Spark provides Cache mechanism to support the need for iterative computation or multiple data sharing.Traditional task partition and resource allocation have not taken data skew into account,which result in unreasonable division of intermediate data and unbalance of task in shuffle process of Spark platform.With the expansion of system scala and the data amount of all task are significantly unbalance,traditional task partition and resource allocation have not taken data skew into account,which result in unreasonable division of intermediate data and unbalance of task node.According to the characteristics of Spark task and the scheduling process,this paper proposes task partition and resource allocation algorithms for data skew(PRDS)in Spark computing environment.This strategy takes data skew and current node workloads into account.For improving the load balance of task node,we firstly split the heavy clusters of<key;value>pairs before the intermediate data be input to the task node,and then propose a remaining time evaluation model for the Spark tasks.Morevoer,we can predict the similar task with the machine learning method considering the uncertainty of task execution,this prediction model for the reminder execution time of the task and the current system load,returns the execution result to guide the task division and the effective task scheduling.Finally,we implement PRDS in Spark and evaluate its performance through widely used benchmarks.Comparing with some original implementations in the Spark system,the experimental results show that PRDS can achieve performance improvement.The PRDS method proposed in this paper not only considers the data skew in the task scheduling process,but also takes full account of the node priority and task priority in the DAG logic scheduling process of the Spark task.In this model,this algorithm is used to judge the similar task in the process of task scheduling and split the optimal partition of the data skew task and the local optimal scheduling based on the greedy algorithm is carried out,considering the priority,the current remaining data amount,node computing capacities and the utilization of cluster resources comprehensively.By changing the experimental parameters such as network throughput and task granularity,the method can obtain the cluster benefit in the scheduling process.
Keywords/Search Tags:Spark, Data Skew, Task Partition, Greedy Algorithm, Task Scheduling
PDF Full Text Request
Related items