Font Size: a A A

Research And Implementation Of Performance Modeling And Optimization Technology Of Spark Computing Framework

Posted on:2018-09-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q WenFull Text:PDF
GTID:2348330521950907Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet in recent years,the amount of data is explosive growth,which leads to the original way cannot meet the existing needs,especially for enterprises.Map Reduce,a distributed computing framework,helps many enterprises solve the needs of data processing.However,the growing scale of data makes the computing ability that Map Reduce provides becomes weak.And the latency of Map Reduce in time is more and more unable to meet the needs of enterprises.In this case,memory-based Spark becomes popular in the enterprises.More and more enterprises choose Spark to process the large data and want the higher performance of Spark.In order to improve the performance of Spark,many organizations and individuals have made lots of effort from different aspects and developed lots of optimization methods.But there are few studies to achieve performance optimization by searching for optimal configuration parameters automatically.In order to solve the problem,this paper presents an optimization method,which can give different optimized parameters for different Spark applications,and reduce the runtime of the program,so as to achieve the effect of Spark performance optimization.The main idea of the optimization method proposed in this paper is to establish the performance model of the Spark framework,run the application of the small data set,and collect the relevant parameters in the performance model by modifying Spark source code.The optimization method in this paper can predict the runtime of the application with large data set.Then,the optimization algorithm model iteratively calls prediction model and finds the optimal parameter configuration set.So that this paper can optimize the runtime of the application.The research of this paper is described as follows.(1)Collecting the application's running data.This paper adds monitoring code in the Spark 1.4.0 version of the code,to collect the data flow and execution time during Task processing,and to collect the execution information of the Job and the DAG information between the stages.The collected information is stored in XML file locally,and the information is used to build up the Task's prediction model,and realize the simulation scheduling model.(2)Building up prediction model.Through reading the Spark 1.4.0 version source code,this paper analyzes the execution scheduling of the application and Task's execution process,and establish the execution time mathematical model of the Task according to the collected data flow,the execution time and the selected configuration parameter information.According to the collected DAG information between the Stages,this paper achieves Spark's simulation scheduling process.Finally,this paper achieves the forecast model of the application.(3)Implementing cost-based optimization algorithms.This paper implements random grid algorithm,recursive random search algorithm,genetic algorithm and particle swarm optimization algorithm,and multiple iterative give different configuration parameters.Then,the prediction model is called to calculate the predicted running time under those configuration parameters.Finally,those optimization algorithms give the optimal configuration parameters,which making the application execution time is the shortest,so as to achieve the purpose of Spark application performance tuning.The experimental part of this paper uses the Hi Bench benchmark platform,which provided by Intel,to verify the Spark performance optimization method.And this paper uses Word Count,Sort,Tera Sort,Page Rank,Kmeans,and Bayes as the test application.The experiment validates the accuracy of the prediction model and the optimization effects of optimization algorithms.Also,compares the optimization effects of four optimization algorithms in CBO and RBO.Finally,the experiment result shows that the optimization method in this paper is better than the role-bases optimization.
Keywords/Search Tags:Performance Model, Configuration Parameters, Performance Optimization, Task Scheduling, Spark
PDF Full Text Request
Related items