Font Size: a A A

Research Of Data Skew On Spark Based On Imporved Partition Method

Posted on:2019-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y K YangFull Text:PDF
GTID:2428330563492462Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development and wide application of Internet technology,people have entered the era of big data.Growing demand for big data processing analysis has promoted the development of related technologies.Google's MapReduce programming model is a popular and popular parallel computing framework.Hadoop,which is an open source implementation of MapReduce,has been widely studied and applied in the field of big data.Spark is a fast and efficient MapReduce implementation that has evolved since Hadoop.And Spark has gradually become the mainstream one-stop big data processing platform.Aggregation and join queries are important and common operations in database queries,and the MapReduce framework structure does not support join operations well.Many aggregation and connection algorithms in the existing MapReduce framework do not deal with data skew well.However,the data distribution is often skew in real life.Skewed data can cause big MapReduce task load difference and that will seriously decrease system resource utilization efficiency.Firstly,we gives a brief introduce of common algorithms for aggregation and join queries,and analyzes the causes and impacts of the data skew problem on the Spark platform.Then this paper present an Skew Adaptable Partition(short as SAP)algorithm based on range division algthorim for data skew problem in aggregation query.The algorithm is based on the improvement of the Simple Range Partition algorithm.The idea of SAP is to acquire the data distribution through sampling,then calculate the network and disk I/O cost of the cluster,and then process large clusters and small clusters separately according to the I/O cost.Big clusters will be assigned to seperate reduce partitions,and small clusters will be combined before they are assigned to the rest reduce partitions,thus we can improve the load balancing degree of the partition.In next step,we propose a Skew Adaptable join(short as SAJ)algorithm for data skew problem in dual equal join.This algorithm is based on the combination of sampling and key devision.The idea of SAJ is that we need to acquire the data distribution by sampling and calculate the I/O cost for each cluster as well.Then we use the method of dividing the large clusters according to the I/O cost or copying keys to assign large clusters to partitions which has smallest capacity and can hold the devided clusters.And small clusters are combined before assigned to remaining partitions.In this way,we propose a better data repartition algorithm to improve the efficiency of dual equal join operations.Finally,we conduct a series of comparison experiments for two algorithms,and the result shows that the proposed algorithm has better load and performance in dealing with skewed data.
Keywords/Search Tags:Big Data, Spark, Data skew, Load balancing
PDF Full Text Request
Related items