Performance Optimization Methods For Shuffle Process Of Spark Platform

Posted on:2019-07-01

Degree:Master

Type:Thesis

Country:China

Candidate:S S Huang

Full Text:PDF

GTID:2428330593950601

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In recent years,with the advent of the era of big data,the corresponding big data processing technology has also continued to develop,resulting in a number of outstanding big data processing platforms,including Hadoop,Spark,and Storm,among which the most notable is Spark.With the extensive application of Spark at home and abroad,some of its problems are also exposed.One of the more prominent problems is its performance problems.How to optimize the performance of the Spark big data platform to improve the overall efficiency of the cluster is one of the most important issues in the research of big data platforms.This paper takes Shuffle process of Spark as the research object.Through analyzing the underlying execution mechanism of Shuffle process of Spark,we choose to study the two aspects that have the greatest impact on the performance of Shuffle process of Spark.They are compression algorithm decision method and optimizing the memory scheduling mechanism.Finally,a management and monitoring system for Spark was designed and implemented.Firstly,the implementation mechanism of the shuffle process of Spark is studied.For the problem that its compression configuration basically relies on the user's experience,a compression model based on cost is proposed.Experiments show that the decision model can predict the optimal compression configuration under actual load on the basis of the small load test.The accuracy of the prediction result is 60%,and the predicted compression configuration can increase up to 30% compared to the default configuration.The average result can be increased by 10%.Secondly,we studied the different memory mechanisms of Spark2.x,Spark1.6.x and previous Spark versions,and pointed out the important role of memory scheduling optimization in the new version.At the same time,two different memory scheduling algorithms FIFO and FAIR of Spark are studied.The performance of two kinds of memory scheduling algorithms when dealing with uniformly distributed data and nonuniformly distributed data is compared through practical examples.The advantages and disadvantages of the two scheduling algorithms are analyzed.Considering that the FAIR memory scheduling algorithm only considers the average of the total number of task allocations and does not consider the difference in the amount of memory required by different tasks,an improved memory scheduling algorithm is proposed.Experiments show that the scheduling algorithm has lower spillover times and shorter running time when dealing with non-uniformly distributed data.Finally,a management and monitoring system for Spark was designed and implemented to solve the problem that the Spark platform configuration and tuning require higher user basic knowledge.The system's function is to provide a common management framework to free the user from the low-level operation of the system.Spark service is provided through a series of management and monitoring functions based on the visual interface.The two most important functions of the system are management and monitoring.The management function is a series of visual configurations that allow users to easily manage the cluster.In a word,this paper has carried on the beneficial exploration to the shuffle process performance optimization method of Spark,and achieved some results in the Shuffle process compression algorithm decision and the memory scheduling optimization of the Shuffle process for Spark platform.These results have important reference value for the performance optimization of the Shuffle process of the Spark big data platform,which is of great significance for improving the utilization rate of Spark cluster resources.

Keywords/Search Tags:

Spark, Shuffle process, compression algorithm, memory scheduling algorithm, monitor system

PDF Full Text Request

Related items

1	Analysis And Optimization Of Memory Scheduling Algorithm Of Spark Shuffle
2	Research On Spark Shuffle Process Performance Optimization
3	Research On Shuffle Mechanism In Spark Cluster
4	Optimization Of Spark Task Scheduler For Shuffle Operators
5	Research On Memory Optimization Algorithm Based On Weight Priority Task Scheduling Strategy In Spark Platform
6	Research On Spark Performance Optimization Technology For In-Memory Computing
7	Research On Product Recommendation Algorithm Based On Spark Big Data Platform
8	The Research On Spark Task Scheduling Strategy Based On Dynamic Memory Awareness
9	Data Transmission And Storage Method Optimization Of Spark Shuffle
10	The Optimization Research Of Spark Memory Allocation And K-means Algorithm