Font Size: a A A

Research On Equi-Join Optimization Algorithms On Spark SQL

Posted on:2020-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:S H LiFull Text:PDF
GTID:2428330578957178Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The development of science technology and the popularity of the Internet have driven the arrival of the ear of big data.In the world,everyday generates huge amouts of data.And the measurement units of data have developed from Byte,KB?MB to PB,EB and even YB?BB.With so much data,big data analysis has become a research hotspot,and some big data platforms led by Hadoop and Spark have also emerged.Spark SQL is a module for processing structured data in Spark.It provides table join operations with low performance,but big data equi-join operation is frequently used in big data analysis.Therefore,this paper optimizes equi-join algorithm in Spark SQL.In order to solve the problem in which the existing equi-join algorithms can not be applied to different scenarios,an equi-join optimization algorithm based on Extended Partial Bloom Filter is proposed,named EPBF Join.The optimization of the algorithm is mainly reflected in the following two aspects:Firstly,Partial Bloom Filter is extended to enable parallel computing,which can reduce the time consumption of data filtering stage and improve overall connection performance.Secondly,EPBF Join can automatically change the number of bit arrays according to the data size and meet the unknown data size scenario.EPBF Join is applicable to the scenarios of both known data size and unknown data size.This paper focuses on the problem of low performance of equi-join operation under the data skew scenario,and an equi-join optimization algorithm for estimate data skew based on Space-Code Bloom Filter is proposed,named SCBF-ESD Join.The optimization and innovation of SCBF-ESD Join are mainly reflected in the following four aspects:Firstly,SCBF-ESD Join introduces a new Space-Code Bloom Filter,which not only can complete data filtering,but also can obtain the frequency of join attributes.It is convenient to calculate the degree of data skew.Secondly,SCBF-ESD Join optimizes the equi-join process,increasing judge data skew stage and reducing data skew stage,so that it can be applied in scenario whether the data is skew.Thirdly,SCBF-ESD Join proposes a new data skew degree calculation strategy in judge data skew stage,which can calculate the data skew degree of filtered data according to the frequency of valid join attributes.Fourthly,a combination strategy of random adding prefix and consistency hash is proposed in the stage of reducing data skew,which realizes disperse repeated attributes and re-partition,and reducing the performance impact which is come by data skew.Two optimization algorithms are analyzed and validated in theory and experiment respectively.Firstly,the theoretical validity of two optimization algorithms is verified by cost analysis,and then many comparative experiments are completed.The experimental results show that EPBF Join algorithm can join efficiently without concern for the data size.SCBF-ESD Join algorithm can estimate data skew degree and adopt different operations to achieving better join performance.The two strategies is proposed in SCBF-ESD Join also are validated.
Keywords/Search Tags:Spark SQL, Equi-join, Data Skew, Extended Partial Bloom Filter, Space-Code Bloom Filter
PDF Full Text Request
Related items