Font Size: a A A

The Optimization Research Of Spark Load Balancing And Big Table Equal Join

Posted on:2020-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2428330575475784Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Spark is one of the mainstream big data computing frameworks.It has fast data processing capability and is easy to use.However,It also has some defects.For example,when a program runs in Spark,the load of each computing node may be unbalanced.The association operation of two big data tables in Spark is inefficient,network communication is too high.Therefore,this paper analyzes and studies the load balancing strategy and large table equal join method of Spark platform,and then optimizes and improves them separately to improve the data processing performance and efficiency of the cluster.The main contents are as follows:(1)Research on the optimization of Spark load balancing strategy and algorithmAiming at the issue that the load balancing strategy of Spark cluster neglects the computing capability and resource utilization of each node,which is prone to load unbalance,an optimized Spark load balancing strategy is proposed in this paper,which implements a different execution node assignment method for task at different Stages.For the Stage containing the source of RDD,a task execution node allocation algorithm based on the genetic algorithm and particle swarm optimization(GA-PSO)is designed and proposed.For the non-source Stage,the best execution location is determined by the ancestor Stage of the Narrow dep in the task allocation process.The experimental results show that compared with the Spark's own load balancing strategy,the load balancing strategy improved achieves a significant improvement both in load deviation and task completion time.(2)Research on the optimization of Spark big table Equal JoinAiming at the problem of large network transmission overhead when Spark handles big table equal join problem,this paper proposes a Spark large table equal join optimization method.Firstly,this method proposes a Split Compressed Bloom Filter algorithm(SCBF)which is suitable for filtering data sets with unknown data volume.Then,the Maxdiff histogram is used to statistically analyze the data distribution of the connected data tables,and the skew data in the data set is obtained.According to the statistical results,the RDD is split,and finally the data connection is joined by a suitable join algorithm,and the sub-results are combined to obtain the final result.The experimental results show that the proposed Spark bigtable equal join optimization method has obvious advantages in terms of data amount in the shuffle write/read and task running time compared with the original Spark method.Finally,the above two algorithms are verified in the self-built Spark cluster experimental environment,and the performance of the improved algorithms and the original algorithms is compared through several experiments.It can be seen from the results that the two improved algorithms improves the load balancing performance of the Spark cluster and the efficiency of the big table equal connection,which shortens the task execution time of the Spark cluster and improves the resource utilization of it.
Keywords/Search Tags:Spark cluster, Load balancing, Big table equal join, Intelligence algorithm, Histogram
PDF Full Text Request
Related items