Font Size: a A A

Performance Prediction And Optimization For Apache Spark Platform

Posted on:2020-09-08Degree:MasterType:Thesis
Country:ChinaCandidate:W Z ChenFull Text:PDF
GTID:2428330602451044Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of cloud computing,how to process and mine large datasets generated by a wide spectrum of applications including social networks,bioinformatics,ecommerce,and healthcare have become an increasingly important and challenging problem.In order to facilitate big data analysis,the industry has emerged a number of evolving frameworks to perform parallel data processing.Big data processing frameworks(such as Spark)have more than 100 configuration parameters that control the behavior of the application,and the parameters often have a significant influence on application performance.However,users usually do not have the domain knowledge of the big data framework.They can't combine the characteristics of the application and the framework to set configuration parameters,so they often use the default configuration.Therefore,how to automatically optimize Spark configuration parameters to improve application performance and reduce time cost is an urgent problem to be solved.In order to solve this problem,many automatic parameter optimization methods have been proposed at home and abroad.However,these existing methods either ignore the optimization cost in the actual scenario or do internal analysis based on a specific version of framework,which is difficult to put into practical application.Therefore,this thesis proposes a tool Auto Tune to optimize the configuration parameters.Considering the time constraints and high-dimensional parameter search space in practical application scenarios,the Testbed environment construction method and the optimization algorithm widely covering the search space are proposed,which realizes efficient automatic configuration optimization.The work in this thesis includes the following aspects:(1)Analyze and model the Spark framework performance prediction and optimization problem,and give its definition.Then describe the method flow of the parameter optimization problem and determine the parameter space to be optimized.(2)A Testbed construction method is designed and implemented to solve the problem of can't execute too many searches or collect training samples in practical application scenarios due to optimization time constraints.The Testbed environment is small scale but accurate enough to capture the configuration parameters' influence on actual production environment. By running applications feed by the reduced dataset at Testbed,we reduce the single-run time and get more training samples to improve the accuracy of the performance prediction model.(3)An iterative parameter optimization algorithm which combines machine learning algorithm and search algorithm is proposed.By using exploration strategy and Latin hypercube sampling algorithm,the generality of search samples in high-dimensional parameter search space is guaranteed.By using exploitation strategy and parameter reduction algorithm,the search range is continuously reduced to find the local optimal solution.In the iteration process,the random forest model is continuously optimized to predict the performance under different configurations,and to guide the exploration and exploitation process.The validity of the proposed Testbed construction method and the advancement of the parameter optimization algorithm are verified by experiments.The experimental verification mainly includes two parts.First,the accuracy of the machine learning model trained under the Testbed or actual production environment in same time is compared by using n DCG metrics.Then,we compare the parameter optimization algorithm and five other optimization methods in same constraints.Experiments show that the optimal configuration generated by parameter optimization algorithm is 63.70% better than the default configuration.Moreover,the obtained optimal configuration in Testbed environment is superior to that obtained in actual production system for all parameter optimization algorithms.
Keywords/Search Tags:Spark, Configuration Parameter, Performance Optimization, Random Forest, Exploration and Exploitation
PDF Full Text Request
Related items