Font Size: a A A

A Research Of Straggler Strategy For Heterogeneous Spark Cluster

Posted on:2019-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:P F ZhangFull Text:PDF
GTID:2428330590965761Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Apache Spark is a scalable in-memory cluster-computing framework.Spark decomposes a job into numerous tasks and assigns them to available nodes for higher efficiency.However,some tasks become stragglers because of node failure,network traffic,I/O discord etc.Stragglers are the tasks that take unusual long time to complete and since an application completes just when it's last undertaking completions.With the continuous expansion of the cluster hardware in the data center,the heterogeneity of a cluster becomes inevitable.Stragglers appear more often in heterogeneous environment.Speculative execution is designed to deal with stragglers by backing up those slow running tasks on alternative machines.This is a typical way of “space for time”.To mitigate stragglers in heterogeneous Spark cluster,this thesis has thoroughly studied the speculative and task scheduling mechanism of Spark,then made some improvements to the deficiencies.The main works are as follows:1.In heterogeneous environment,default speculative execution strategy of Spark has low accuracy in straggler identification and ignores the causes of stragglers which may further extend the execution time.Therefore,we develop an improved speculative execution strategy DBMTPE(Data-Based Multiple Phases Time Estimation),which can select stragglers by estimating their remaining time based on data volume and filter those straggler that cannot be accelerated according to their data volume and task progresss.Compared with Spark-None and Spark-Native strategy through experiments,DBMPTE can effectively shorten the execution time of an application and save computing resource at the same time.2.Due to the rough analysis of data locality and the variety of machine capabilities,native task scheduler of Spark may assign backup tasks to other slow nodes,which leads to the failure of speculative execution.To address this problem,we further propose HSBTS(Backup Task Scheduling Strategy for Heterogeneous Spark Cluster)on the basis of DBMPTE.HSBTS determines which node to be assigned by data locality and node performance.Compared with native scheduling strategy and DBMPTE with native scheduling strategy,HSBTS will try its best to assign a backup task to the node with the highest execution efficiency.The research shows that straggler problem in heterogeneous Spark cluster can be mitigated by accurate identification of straggler and effective scheduling of the backup tasks.This thesis proposes two kinds of straggler mitigation strategy DBMPTE and HSBTS,which can effectively shorten the execution time of an application and improve the performance of Spark cluster.
Keywords/Search Tags:heterogeneous Spark cluster, straggler, speculative execution, backup task scheduling
PDF Full Text Request
Related items