Font Size: a A A

A Key-Value Skew Model Based Dynamic Data Partitioning Algorithm In Spark

Posted on:2020-09-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y F YanFull Text:PDF
GTID:2428330572473656Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the amount of data shows a geometric growth.When dealing with massive data,it is difficult for a single machine to support such requirements,and a distributed cluster is often required to complete the data processing and storage.Spark is a high-performance distributed data processin g framework that can utilize the memory of the cluster to improve the efficiency of data processing.In the distributed environment,when data skew occurs,the data volume of each executor in the cluster is unbalanced.One or several machines may need to process is too large amount of data,while other machines are only allocated to a small amount of data,resulting in insufficient utilization of resources in the cluster.When processing skewed data on Spark,the efficiency of the j ob will be greatly affected.This thesis describes the implementation process of Spark's distributed data processing framework and analyzes the root cause of its low efficiency in processing skewed data.This thesis proposed a key-value skew model that can uniformly quantify the skew of key frequencies and value sizes.The target function in this model can be used as an important index to evaluate the load balance degree of reduce tasks in Spark.To solve the problem of low efficiency in processing skewed data on Spark platform,this thesis proposes a dynamic data partitioning algorithm that can make each machine in the cluster process similar amount of data and achieve load balancing.According to skew degree of the intermediate data from the map tasks,different partitioning algorithms are used to achieve efficient data partitioning.When data is slightly skewed,the hash based best fit method is adopted for allocation.When data is heavily skewed,this algorithm converts the heavy skew data to slight data skew by splitting oversize data clusters of key-value tuples from each map task,then the allocation is performed by hash based best fit method.This thesis proposes a dynamic data partitioning scheme based on the reservoir sampling algorithm and the dynamic data partitioning algorithm.Firstly,the reservoir sampling is adopted to estimate the skew degree of intermediate data.Then,intermediate data is pre-partitioned by the dynamic data partitioning algorithm to obtain a reference partitioning table.According to this table,all intermediate data can be allocated evenly.Through three different data skew scenarios and three comparison algorithms under different skew degrees,this thesis proves that the dynamic partitioning scheme can effectively solve the problems caused by the skewed key and skewed value cases under different skew levels in the Spark platform.This method can make the amount of data received by each working node in the cluster similar,achieve load balancing of the cluster,and improve the efficiency of the Spark job.
Keywords/Search Tags:data skew, partitioning algorithm, load balancing, Spark
PDF Full Text Request
Related items