Font Size: a A A

Research On Data Skew Optimization In Spark Computing Framework

Posted on:2019-06-20Degree:MasterType:Thesis
Country:ChinaCandidate:T Y ZhangFull Text:PDF
GTID:2428330590965717Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Spark is a memory-based distributed data processing framework,which has the ability of processing massive data and has become a research focus in the field of big data.In Spark,Shuffle is between the Map and the Reduce phase,which transforms the intermediate data from Map side to the Reduce side.As the performance of the Shuffle process is heavily depends on the data distribution,the Spark default Hash partition can not guarantee the load balance in the Reduce phase when there is a data skew problem which affects the execution time of the Job.This thesis studies the RDD partition strategy and the Shuffle mechanism within the framework of Spark computing.Comparing with the latest research methods in this field,the Dynamic Re-partition Strategy Based on the Key Distribution(DR)and the Optimization Fragmentation Based on the Cost Model(OFCM)are proposed to alleviate the slow task pressure caused by the imbalance of data distribution in the Reduce side.So that the Spark Task processing efficiency is improved.Specifically,the main work in this thesis is as follows:1.In order to handle the data skew problem in Spark Shuffle phase caused by the unbalanced data distribution which lead to a large amount of data in the slow Task,the dynamic sampling method is designed.In the process of Task execution,the histogram is used to count the Key frequency distribution on each node,and then the global Key frequency distribution histogram is generated after collection.Accordingly,the dynamic re-partition strategy based on Key distribution(DR)is proposed.Finally,the DR algorithm is compared with the Spark default Hash partition,the Fine Partitioning algorithm and the Balanced-Schedule algorithm through the experiment.The results show that the strategy can reduce the overall execution time of the computing Task and thus improve the execution efficiency of the Spark cluster.2.On the basis of the DR algorithm,the further optimization and improvement is carried out by the thesis proposing the OFCM strategy.The size of cluster and computational complexity are used to evaluate the weight of the calculation in the RDD partition,and then the large partitions are divided into multiple small fragments,and additional copy operation is added when necessary.And a cost model is built to balance the cost and the data balanced degree.In the experiment,the OFCM and DR algorithmsare compared by changing the influence factors.The results show that the OFCM algorithm can effectively alleviate the cluster computing pressure and solve the data skew problem in the heavily skewed application scene.Research works shows that the data skew problem is objectively exist during the Spark Shuffle phase,but it can be improved by detecting the skewed Key and using a efficient re-partition strategy.The DR strategy and the OFCM strategy proposed in this thesis can effectively solve the problem of Job delay caused by the data skew,and improve the efficiency of the cluster task execution significantly which has important theoretical value and practical significance.
Keywords/Search Tags:Shuffle, data skew, re-partition, cost model, fragment
PDF Full Text Request
Related items