Font Size: a A A

Cost-based Configuration Optimization Analysis For Apache Spark

Posted on:2020-04-14Degree:MasterType:Thesis
Country:ChinaCandidate:T WangFull Text:PDF
GTID:2428330572473701Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the information society,parallel computing is widely used in the industry of its powerful advantages in computing speed and computing power.A series of distributed computing platforms led by MapReduce and Apache Spark have been favored by the industry.However,in actual use,due to the diverse and complex configuration parameters existing in the distributed computing platform,it is easy to cause job performance issues.Studies have shown that unreasonable job configurations can lead to a sharp drop in job performance,reduced cluster resource utilization,and even failure to run jobs.In the actual application scenario,due to the complexity of the cluster environment,the abstraction of the internal principles of the Spark cluster,and the complexity of the distributed operation itself,the performance tuning is dependent on the long-term experience of the operation and maintenance personnel and does not have general purpose.Spark has a huge parameter space,where multiple parameters directly affect job performance.At present,the existing tuning methods mostly rely on manual tuning by users.The method is slow and inefficient,and as the cluster size continues to expand,the complexity of the method increases dramatically.This paper designs a cost-based Spark job configuration optimization algorithm,and based on this,realizes the automatic tuning of Spark job configuration parameters.The optimization algorithm proposed in this design is divided into two parts:Spark job performance modeling and Spark job configuration optimization.Spark job performance modeling builds a performance model using gradient boosting algorithm for each class of Spark jobs with similar resource utilization characteristics.The Spark job configuration optimization model is based on the performance model built in this paper to optimize the search configuration parameters.Based on the Spark job configuration automation optimization system and realizes automatic configuration optimization.Moreover,the system includes an operational performance monitoring subsystem that continuously collects cluster job information for offline modeling and continuously optimizies system accuracy.
Keywords/Search Tags:spark, performance-prediction, configuration-optimization, machine-learning
PDF Full Text Request
Related items