Optimization Research On Shuffle Mechanism For Big Data Processing System

Posted on:2021-05-10

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Geng

Full Text:PDF

GTID:2428330647450737

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In the application of big data analysis,data analysis framework represented by Hadoop and Spark are widely used in various industries.With the rapidly growing size of data,failure of tasks or cluster nodes is the norm at large scale deployment of big data framework.Therefore,all to all data transfer which is called shuffle operations in Map Reduce frequently fails,and then some tasks need to be recomputed.Otherwise,Jobs are often splitted into smaller tasks for better parallelism and reducing the straggler task effect.In addition,data processed by per task is less and can be kept in memory without spilling into disk.However,shuffle becomes the bottleneck when running many small tasks in multi-stage data analytics jobs.In order to achieve a balance between shuffle cost and the other,tuners often need to go through lengthy tuning experiments to get optimized shuffle paradism and obtain better performance.In order to solve the problems of stability and performance of shuffle,this paper proposes a new shuffle based on multi-copies of data files and pre-merged data files.Besides,a novel method for shuffle paralism auto-tuning based on machine leaning and intelligent search algorithm is proposed to resolve the issues in tuning shuffle parallelism.In the paper,the new shuffle and the method for shuffle paralism autotuning have been implemented on Spark.The main work and contributions of this article include:(1)This research proposes an abstract shuffle model which is general and platform-independent.The model based on multi-copies of files and pre-merged files can provide data analysis framework with good perfomance and stability.(2)Furthermore,we implement the shuffle model based on multi-copies of files with Spark.The framework uses a distributed file system to provide the multiple copies of shuffle data file for fault tolerance.To improve efficiency of reading file copies,the framework has a hierarchical index inspired by the job structure.(3)Furthermore,we implement the shuffle model based on pre-merged files with Spark.The framework merges shuffle files by taking advantage of the time slot between the first compeleted task and the last one.Files are merged by the order of partitions.the mechanism of pre-merged most files and the merge strategy considering data locality have been proposed to speed up the file merge process.(4)A shuffle parallelism auto-tuning method based on machine learning and intelligent search algorithms is proposed to get the best value and improve job performance.The method consists of two parts,one is a performance prediction model based on the regression model to describe the relationship between the job characteristics such as Shuffle parallelism and the job runtime.The other is the grid search algorithm to find the optimal shuffle parallelism that makes the model predict the shortest runtime.(5)The experiments show that the new shuffle proposed in this paper can provide 100% stability and can improve the performance of the job by at least 5% to 30% in the case of shuffle failure.When running jobs have multi-small tasks,the performance is improved by 15% to 40%;The Shuffle parallelism auto-tuning method proposed in this paper greatly shortens the job running time and improves the job performance by about 20% on average.(6)As a practical application case,shuffle implemented on Spark in this paper has been deployed in Bytedance production environment,which has been running steadily for more than one year and covers more than 57% of the workloads with 12% perfomance improvement on average.The platform provides stable and fast sql query for the data analysts and data engineers.

Keywords/Search Tags:

shuffle service, parallel computing, parameter auto-tuning, big data analystics framework, performance optimization

PDF Full Text Request

Related items

1	A framework for performance tuning and analysis on parallel computing platforms
2	Research On Significant Technologies Of Performance Optimization On In-memory Computing Framework
3	Research On Auto-tuning Techniques For Parallel CFD Programs
4	PID Controller Parameter Tuning And Optimization Technology
5	OPS:Optimized Shuffle Management For Distributed Computing Frameworks
6	Research On PID Controller Parameter Auto-Tuning Method
7	Study On The Method And Key Technology Of The Runtime Parameter Optimization Of The MPI In The Parallel Computing System
8	Near-Optimal Prior-Based Parameter Tuning Method For Distributed Storage System
9	Optimization Of Convolution Based On CPU SIMD Instruction Set
10	Research On Data Skew Optimization In Spark Computing Framework