Research On And Application Of The Solution For Spark Data Skew Scenarios

Posted on:2021-06-24

Degree:Master

Type:Thesis

Country:China

Candidate:Z J Jiang

Full Text:PDF

GTID:2518306050472834

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

Data skew refers to the situation where the amount of data in one partition often appears to be much larger than the amount of data in other partitions under the big data platform,resulting in uneven distribution.If the data allocated to each node is uneven,the execution time of some tasks will be much longer than other tasks during the execution,which not only affects the performance of application execution seriously,but also excessively occupies resources which may cause the system to crash.The topic of this thesis is based on the actual development of enterprise application.It mainly focuses on the phenomenon of data skew in the Spark production environment.This phenomenon generally occurs in Spark real-time processing and two table join scenarios.Among them,for the real-time processing scenario,the random allocation of partition data in the message queue Kafka and the accumulation of tasks to be executed on a certain Executor or the same localization level can easily cause data skew.And during the process of Spark two table join,there are several problems that can easily lead to data skew: the disadvantages of hash distribution in the Shuffle process,the low resource utilization caused by the uniform number of Reducers in the physical operator tree,and the incorrect estimation of the amount of data leading to the choice of execution mode with Shuffle operations.Therefore,the data skew problem in the above two scenarios has become the bottleneck of Spark distributed computing and performance improvement.Based on the business requirements of the Spark data skew scenario,and based on the recent domestic and foreign status of the solution for the data skew problem,this paper designs and implements a universally applicable solution through the research and analysis of the source code of Spark Streaming and Spark SQL.The specific works are as follows:1)For the localization differences between different versions of Kafka,the existing modular operation is used to change the way of random data distribution by implementing ordered dynamic binding and localized dynamic binding.Based on the dynamic binding method,with the help of Spark's back pressure mechanism controls the execution consumption rate to solve the problem of data accumulation in real-time processing scenarios;2)The kernel code introduces a custom localization rate calculation formula to allow Spark real-time processing system to decide the degradation operation by itself,reducing the error caused by the user setting the localization level waiting time;3)Determine the number of Reducers by using a custom inclination calculation formula and severely inclined partitioning.Dynamically select a suitable execution method according to the amount of output data of stage during the execution of the SQL statement to achieve the best execution plan;4)At Map side,adopts the method of reading part of the data and joining multiple times.Proposes a partition reorganization algorithm based MDP to achieve the optimal combination of partition data on the Reducer side to solve the performance problem caused by the disadvantages of hash distribution in the two table join scenario;5)Comprehensive testing and comparison on the functional and non-functional requirements of the two scenarios prove that the solution proposed in this paper can indeed improve the throughput and performance of Spark data processing in data skew scenarios.Currently the source code of the solution for data skew scenario is merged into the company's self-developed product kernel repository in a non-intrusive way.This is to prevent the original code being affected.The optimization solution can only take effect when the relevant configurations are turned on.The solution basically solved the data skew problem found by personnel in daily maintenance operation.It has been applied to many business scenarios which using this product.However,with the increasing amount of data and the diversity of coverage scenarios,some special data skew scenarios may not be considered,and continuous improvement is required.

Keywords/Search Tags:

localization, Spark SQL, data skew, MDP, dynamic binding

PDF Full Text Request

Related items

1	A Key-Value Skew Model Based Dynamic Data Partitioning Algorithm In Spark
2	Research On Partition Loading Balance Based On Spark Data Skew
3	Research Of Data Skew On Spark Based On Imporved Partition Method
4	Research On Optimization Methods Of Dynamic Equilibrium Partition Method For Data Skew In Spark Shuffle
5	Research And Optimization Of Adaptive Techniques For Mitigating Skew In Spark
6	Spark Task Scheduling With Data Skew And Deadline Constraints
7	Research And Implementation On Anti-skew Spark Intermediate Data Partition Mechanism
8	Research Of Performance Optimization For Data Skew Based On High-speed Networks
9	A Research Of Load Balancing Algorithms For Data Skew In Spark
10	Research On Data Skew Optimization In Spark Computing Framework