Font Size: a A A

Optimization Of Spark Task Scheduler For Shuffle Operators

Posted on:2021-06-28Degree:MasterType:Thesis
Country:ChinaCandidate:H Q WuFull Text:PDF
GTID:2518306107953249Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Distributed computing provides us a brand-new platform for massive data processing and efficient analysis.Nowadays,Spark is widely applied in the field of big data processing with the development of in-memory computing.The task scheduler in Spark determines the distribution of data and execution route,which can the lower execution efficiency of the whole cluster.Thus,conducting optimization research on the Spark task scheduler can improve the performance of the Spark cluster.In order to solve the problem that the characteristics of operators and data distribution are not fully considered in the Spark task scheduler,a task scheduler optimization method for shuffle operators is implemented.In the shuffle operator-oriented task scheduler optimization method,the global Map stage key data distribution information is obtained by sampling the data before shuffle,and the operator information is collected through the shuffle operator-point recognition algorithm,combined with operator characteristics,Data distribution characteristics and cluster parameters construct a cost estimation model and conclude a reasonable partition number decision algorithm.On condition that data is heavily skewed,a data balancing partition method based on distribution perception is implemented.Finally,combined with the predicted distribution of intermediate data on each node,a heuristic task assignment strategy based on data locality is implemented.For different operator types,the optimization effect of this scheduling optimization method in data skewed and non-skewed scenarios is verified.Experiments show that,compared with the default mode,the scheduling optimization method has obvious optimization effects in iterative applications in both data-inclined scenarios and non-datainclined scenarios.
Keywords/Search Tags:Distributed cluster, shuffle operator, data skew, task scheduling
PDF Full Text Request
Related items