Cost-based Configuration Optimization Analysis For Apache Spark

Posted on:2020-04-14

Degree:Master

Type:Thesis

Country:China

Candidate:T Wang

Full Text:PDF

GTID:2428330572473701

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the information society,parallel computing is widely used in the industry of its powerful advantages in computing speed and computing power.A series of distributed computing platforms led by MapReduce and Apache Spark have been favored by the industry.However,in actual use,due to the diverse and complex configuration parameters existing in the distributed computing platform,it is easy to cause job performance issues.Studies have shown that unreasonable job configurations can lead to a sharp drop in job performance,reduced cluster resource utilization,and even failure to run jobs.In the actual application scenario,due to the complexity of the cluster environment,the abstraction of the internal principles of the Spark cluster,and the complexity of the distributed operation itself,the performance tuning is dependent on the long-term experience of the operation and maintenance personnel and does not have general purpose.Spark has a huge parameter space,where multiple parameters directly affect job performance.At present,the existing tuning methods mostly rely on manual tuning by users.The method is slow and inefficient,and as the cluster size continues to expand,the complexity of the method increases dramatically.This paper designs a cost-based Spark job configuration optimization algorithm,and based on this,realizes the automatic tuning of Spark job configuration parameters.The optimization algorithm proposed in this design is divided into two parts:Spark job performance modeling and Spark job configuration optimization.Spark job performance modeling builds a performance model using gradient boosting algorithm for each class of Spark jobs with similar resource utilization characteristics.The Spark job configuration optimization model is based on the performance model built in this paper to optimize the search configuration parameters.Based on the Spark job configuration automation optimization system and realizes automatic configuration optimization.Moreover,the system includes an operational performance monitoring subsystem that continuously collects cluster job information for offline modeling and continuously optimizies system accuracy.

Keywords/Search Tags:

spark, performance-prediction, configuration-optimization, machine-learning

PDF Full Text Request

Related items

1	Research And Implementation Of Spark Application Performance Prediction Model Based On Machine Learning
2	Bigdata Job Performance Prediction Based On Apache Spark
3	Performance Prediction And Optimization For Apache Spark Platform
4	Research And Implementation Of Performance Modeling And Optimization Technology Of Spark Computing Framework
5	Research On Performance Optimization And Parameter Configuration Strategy Of Spark Platform
6	Research And Implementation Of Spark Performance Optimization For Police Data Processing
7	Performance Prediction And Optimization Of Apache Spark Based On SRFRP Model
8	Research On Workload-specific Memory Configuration Of Spark Workloads
9	Research On Spark Optimization Based On Fine-grained Monitoring
10	Research On Data-driven Performance Prediction And Optimization Of HPC Programs