Research And Application Of SQL Join Optimization Based On Spark

Posted on:2018-11-23

Degree:Master

Type:Thesis

Country:China

Candidate:S Shao

Full Text:PDF

GTID:2348330512993104

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet and IT technology,the data volume of all walks of life is growing at a high speed.In order to analyze and make use of a large volume of data better,big data platform emerges at the historic moment.Hadoop platform can complete the general large data analysis and processing,but with the explosion of information and the improvement of people's demand for real-time processing,its performance is not sufficient in the age of big data.As a new distributed processing platform,Spark is gradually getting into attention and application.Spark SQL is the processing module for structured data in Spark,and it is also the core part of Spark.Due to the low performance of the table connection in Spark SQL,the Spark Join operation,the overall performance of Spark is seriously affected.Therefore,in this paper we researched on the connection optimization of Spark SQL.Firstly,based on the excellent strategy of other researchers and the Broadcast Join and Hash Join provided by Spark,a Spark Join optimization scheme CBF Join,which is based on Compressed Bloom Filter,is proposed,aiming at large table equivalence connection on Spark.The main idea of this scheme is to filter out most of the records that do not meet the connection conditions in advance,and reduce the amount of Shuffle data in the connection process,thereby improving the performance of large table equivalence connections.Secondly,after analyzing the limitation of the first optimization method,we proposed a Spark Join optimization scheme SBF Join based on the Split Bloom filter,in which the method of dynamical bit arrays generation is used,in order to compensate the deficiency of CBF Join when the data table with unknown size is connected.Furthermore,we studied the degradation of the spark join performance caused by the data inclination,and proposes the optimization scheme of Spark join which is called Skew Join in the data tilt scene.In this scheme,connection properties are generated by histogram of the data skew,which are handled separately from normal connection properties,reducing the time consuming and increasing the efficiency of the equivalent connection between large tables.Finally,we validated the above research results by four experiments,comparing the data volume and running time of Shuffle stage,to prove that the performance of Spark distributed computing framework is better than Hadoop distributed computing framework.The CBF Join method presented in this paper is superior to the Spark's Hash Join in the common scenario.The SBF Join method performs better than Hash Join in the unknown scene of data table to be connected;The Skew Join method has better performance than Hash Join and CBF Join in the presence of data skew scenarios in connection properties.The experimental results show that the three optimization schemes proposed in this paper can improve the efficiency of the large scale equivalent operation of Spark SQL.

Keywords/Search Tags:

Spark SQL, Compressed Bloom Filter, Split Bloom Filter, Histogram

PDF Full Text Request

Related items

1	Privacy Preserved Bloom Filter And Key-value Based Bloom Filter
2	Research And Application Of Data Deduplication Technology Based On Bloom Filter
3	Research On Equi-Join Optimization Algorithms On Spark SQL
4	Multi-Bloom-Filter Query Algorithms And Their Applications
5	Study And Implementaion Of SPARK SQL Query Optimization
6	Research And Application Of Bloom Filter In Duplicated Webpages Deletion
7	Researches And Applications On Efficient Bloom Filter For Big Data
8	Research And Application Of Multi-pattern Matching Engine Based On Bloom Filter
9	Research And Application Of Multi-Pattern Matching Engine Based On Bloom Filter
10	The Design Of Bloom Filter Algorithm For Key-value Storage