Real-time Mass Data Processing Analysis And Optimization Based On Spark

Posted on:2019-01-03

Degree:Master

Type:Thesis

Country:China

Candidate:B Huang

Full Text:PDF

GTID:2428330545467536

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the increasing use of real-time big data processing frameworks,the need for optimization of application performance based on big data frameworks is also increasing,and the requirements are getting higher and higher.Spark,as the industry's most widely used and highly accepted distributed real-time big data processing framework,is widely used in real-time image processing due to its high stream processing performance.Spark is based on a distributed hardware cluster and adjusts the efficiency of Spark operations through various configuration parameters.There are many factors that need to be considered in the optimization of Spark's performance.Therefore,on the basis of having a large optimization space,the optimization is also very difficult,especially for the specific requirements,the multi-objective optimization problem of optimizing multiple parameters.At this stage,there are mainly two methods for performance prediction and optimization of Spark: machine learning algorithms and system behavior modeling.However,these two methods have poor generality and cannot solve the problems of reverse multi-parameter optimization and low accuracy of forward performance prediction.Based on the above-mentioned issues,relying on the actual optimization needs of the project,this paper investigates the operating mechanism of Spark.After fully exploring the important factors affecting Spark's performance,it proposes a Spark performance prediction based on GBDT algorithm and genetic algorithm.Inverse multi-parameter optimization algorithm SGBDTP-GA.The algorithm not only can accurately predict the execution time of the program running on the Spark platform,but also can optimize the multi-objective parameters in reverse based on the prediction model.In this paper,there are two types of factors that affect Spark performance: hardware configuration parameters,Spark cluster software configuration parameters,and a total of 22 characteristic parameters.The output is the execution time of a specific task on the Spark platform.The experimental data used in this paper is the face photo in real scene.The experimental data is obtained by extracting the feature value of the face photo,and the experimental data is subjected to the algorithm operation on the Spark platform to obtain the load time.Training SGBDTP-GA model based on load time.On this basis,the Spark performance prediction and optimization system for face feature comparison is implemented.The system has the function of face feature comparison algorithm performance prediction,and at the same time has the ability to reverse the multi-objective parameter optimization under certain constraints.,you can give the recommendation of the computing resources and the optimal parameter configuration table of the Spark platform according to the specific execution time requirements and the task size.In this paper,experiments are conducted in a Spark cluster environment.The experimental results show that the SGBDTP-GA model can accurately predict the performance of untrained samples.At the same time,based on the prediction model,the best software and hardware parameters can be configured based on genetic algorithms.It has a guiding role in the configuration of the hardware cluster parameters and the construction of the software platform during the actual project deployment.

Keywords/Search Tags:

Spark, Performance Prediction, GBDT, Genetic Algorithm

PDF Full Text Request

Related items

1	Bigdata Job Performance Prediction Based On Apache Spark
2	Performance Prediction And Optimization Of Apache Spark Based On SRFRP Model
3	Research And Implementation Of Spark Application Performance Prediction Model Based On Machine Learning
4	Research And Implementation Of Spark Performance Optimization For Police Data Processing
5	Research On Optimal Allocation Strategy Of Spark Resources Based On Performance Prediction
6	Research On Job Scheduling And Memory Cache Optimization Based On SPARK
7	Performance Prediction And Optimization For Apache Spark Platform
8	Research On Spark Optimization Based On Fine-grained Monitoring
9	Prediction Of Click-through Rate Of Internet Advertising Based On Genetic Neural Network
10	Study Of Parallel Genetic Algorithm Based On Spark Solving The Traveling Salesman Problem