Font Size: a A A

Research And Implementation Of Spark Application Performance Prediction Model Based On Machine Learning

Posted on:2019-04-30Degree:MasterType:Thesis
Country:ChinaCandidate:Q B GeFull Text:PDF
GTID:2428330566470845Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Based on the requirements of big data processing and distributed computing,many distributed computing frameworks come into being.Reasonable allocation of cluster resources in distributed computing framework has an important impact on computing efficiency,because performance prediction is the basis and key to optimizing cluster resource allocation.Therefore,the demand for performance prediction of big data computing operations is also increasing.As a widely used computing platform in the field of big data processing,it is important for Spark to reasonably allocate cluster resources to optimize the performance of Spark operations.This paper is based on this and combines the needs of laboratory projects and proposes a Spark performance prediction model.This paper analyzes the architecture and job operation mechanism of the Spark computing framework and selects relevant factors that affect the performance of the Spark job: the design logic of the Spark job itself,cluster resource allocation,and Spark shuffle.The latter two can be mapped to the configuration of the Spark cluster.Parameters,so the model established in this paper focuses on the relationship between job input data volume,job type and configuration parameters and performance.Through in-depth analysis of a large amount of experimental data,the concept of the key stage of the Spark operation is proposed.This concept is the foundation for the subsequent model establishment.This paper uses control variables to establish two models of performance prediction.Based on the key stage,the first model only studies the different input data volume and job running time of the job,and proposes a key phase-based performance prediction model.This model is based on collecting the relevant information of small batches of running data to predict the running time of large data sets in the case of a follow-up model.Then,this paper uses Naive Bayes,support vector machine and decision tree method to establish different prediction models.The basic idea of the model is to select similar jobs for prediction,which needs to calculate the similarity of the Spark job.The article selects editing using the DAG map.Distance calculates the similarity of two jobs.The use of key stages in calculating the edit distance of a DAG graph greatly simplifies the computational complexity.Finally,the paper verifies through experiments that the model established in this paper has good accuracy.
Keywords/Search Tags:Distributed Computing Framework, Spark, Performance Prediction, Key Stages, Machine Learning
PDF Full Text Request
Related items