The Optimization Research Of Spark Load Balancing And Big Table Equal Join

Posted on:2020-03-31

Degree:Master

Type:Thesis

Country:China

Candidate:L Zhang

Full Text:PDF

GTID:2428330575475784

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Spark is one of the mainstream big data computing frameworks.It has fast data processing capability and is easy to use.However,It also has some defects.For example,when a program runs in Spark,the load of each computing node may be unbalanced.The association operation of two big data tables in Spark is inefficient,network communication is too high.Therefore,this paper analyzes and studies the load balancing strategy and large table equal join method of Spark platform,and then optimizes and improves them separately to improve the data processing performance and efficiency of the cluster.The main contents are as follows:(1)Research on the optimization of Spark load balancing strategy and algorithmAiming at the issue that the load balancing strategy of Spark cluster neglects the computing capability and resource utilization of each node,which is prone to load unbalance,an optimized Spark load balancing strategy is proposed in this paper,which implements a different execution node assignment method for task at different Stages.For the Stage containing the source of RDD,a task execution node allocation algorithm based on the genetic algorithm and particle swarm optimization(GA-PSO)is designed and proposed.For the non-source Stage,the best execution location is determined by the ancestor Stage of the Narrow dep in the task allocation process.The experimental results show that compared with the Spark's own load balancing strategy,the load balancing strategy improved achieves a significant improvement both in load deviation and task completion time.(2)Research on the optimization of Spark big table Equal JoinAiming at the problem of large network transmission overhead when Spark handles big table equal join problem,this paper proposes a Spark large table equal join optimization method.Firstly,this method proposes a Split Compressed Bloom Filter algorithm(SCBF)which is suitable for filtering data sets with unknown data volume.Then,the Maxdiff histogram is used to statistically analyze the data distribution of the connected data tables,and the skew data in the data set is obtained.According to the statistical results,the RDD is split,and finally the data connection is joined by a suitable join algorithm,and the sub-results are combined to obtain the final result.The experimental results show that the proposed Spark bigtable equal join optimization method has obvious advantages in terms of data amount in the shuffle write/read and task running time compared with the original Spark method.Finally,the above two algorithms are verified in the self-built Spark cluster experimental environment,and the performance of the improved algorithms and the original algorithms is compared through several experiments.It can be seen from the results that the two improved algorithms improves the load balancing performance of the Spark cluster and the efficiency of the big table equal connection,which shortens the task execution time of the Spark cluster and improves the resource utilization of it.

Keywords/Search Tags:

Spark cluster, Load balancing, Big table equal join, Intelligence algorithm, Histogram

PDF Full Text Request

Related items

1	Research On Cardinalities Estimation Of Two Table For Join Operator Based On Spark SQL Platform
2	The Application Of Linux Cluster System Based On The Load Balancing Algorithm In Webgis
3	Research And Improvement Of Load Balancing Algorithm For Web Cluster
4	An Intermediate Data Placement Algorithm For Load Balancing In Spark Computing Environment
5	Research On Cluster Load Balancing Algorithm Based On LVS Database
6	Research And Implementation Of LVS Cluster Load Balancing Scheduling Algorithm Based On PSO-GA
7	Research And Improvement Of Server Cluster Load Balancing Strategy Based On Nginx
8	Load Balancing And Optimization Management Algorithm For GPU Cluster Cloud Rendering Platform
9	Analysis And Research Of Load Balancing Algorithm Based On Linux Cluster System
10	The Research And Improvement Of Load Balancing For Web Cluster Based On Nginx