Research And Implementation Of Spark Application Performance Prediction Model Based On Machine Learning

Posted on:2019-04-30

Degree:Master

Type:Thesis

Country:China

Candidate:Q B Ge

Full Text:PDF

GTID:2428330566470845

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

Based on the requirements of big data processing and distributed computing,many distributed computing frameworks come into being.Reasonable allocation of cluster resources in distributed computing framework has an important impact on computing efficiency,because performance prediction is the basis and key to optimizing cluster resource allocation.Therefore,the demand for performance prediction of big data computing operations is also increasing.As a widely used computing platform in the field of big data processing,it is important for Spark to reasonably allocate cluster resources to optimize the performance of Spark operations.This paper is based on this and combines the needs of laboratory projects and proposes a Spark performance prediction model.This paper analyzes the architecture and job operation mechanism of the Spark computing framework and selects relevant factors that affect the performance of the Spark job: the design logic of the Spark job itself,cluster resource allocation,and Spark shuffle.The latter two can be mapped to the configuration of the Spark cluster.Parameters,so the model established in this paper focuses on the relationship between job input data volume,job type and configuration parameters and performance.Through in-depth analysis of a large amount of experimental data,the concept of the key stage of the Spark operation is proposed.This concept is the foundation for the subsequent model establishment.This paper uses control variables to establish two models of performance prediction.Based on the key stage,the first model only studies the different input data volume and job running time of the job,and proposes a key phase-based performance prediction model.This model is based on collecting the relevant information of small batches of running data to predict the running time of large data sets in the case of a follow-up model.Then,this paper uses Naive Bayes,support vector machine and decision tree method to establish different prediction models.The basic idea of the model is to select similar jobs for prediction,which needs to calculate the similarity of the Spark job.The article selects editing using the DAG map.Distance calculates the similarity of two jobs.The use of key stages in calculating the edit distance of a DAG graph greatly simplifies the computational complexity.Finally,the paper verifies through experiments that the model established in this paper has good accuracy.

Keywords/Search Tags:

Distributed Computing Framework, Spark, Performance Prediction, Key Stages, Machine Learning

PDF Full Text Request

Related items

1	A High-Performance Chinese Distributed Computing System (CH-Spark)
2	Research And Implementation Of Distributed Machine Learning Algorithms Orchestration System For Big Data Processing
3	Bigdata Job Performance Prediction Based On Apache Spark
4	Performance Anomaly Prediction Monitoring In Distributed Computing Environment
5	Design And Implementation Of Mobile Network Performance Prediction System Based On Machine Learning
6	A System For Distributed MD Data Analysis Based On Spark
7	Research And Implementation Of Performance Modeling And Optimization Technology Of Spark Computing Framework
8	Research On Text Sentiment Analysis Via Spark And Machine Learning
9	Cost-based Configuration Optimization Analysis For Apache Spark
10	Research On Distributed Manifold Learning Algorithm Based On Spark