Font Size: a A A

Research And Application Of SQL Join Optimization Based On Spark

Posted on:2018-11-23Degree:MasterType:Thesis
Country:ChinaCandidate:S ShaoFull Text:PDF
GTID:2348330512993104Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and IT technology,the data volume of all walks of life is growing at a high speed.In order to analyze and make use of a large volume of data better,big data platform emerges at the historic moment.Hadoop platform can complete the general large data analysis and processing,but with the explosion of information and the improvement of people's demand for real-time processing,its performance is not sufficient in the age of big data.As a new distributed processing platform,Spark is gradually getting into attention and application.Spark SQL is the processing module for structured data in Spark,and it is also the core part of Spark.Due to the low performance of the table connection in Spark SQL,the Spark Join operation,the overall performance of Spark is seriously affected.Therefore,in this paper we researched on the connection optimization of Spark SQL.Firstly,based on the excellent strategy of other researchers and the Broadcast Join and Hash Join provided by Spark,a Spark Join optimization scheme CBF Join,which is based on Compressed Bloom Filter,is proposed,aiming at large table equivalence connection on Spark.The main idea of this scheme is to filter out most of the records that do not meet the connection conditions in advance,and reduce the amount of Shuffle data in the connection process,thereby improving the performance of large table equivalence connections.Secondly,after analyzing the limitation of the first optimization method,we proposed a Spark Join optimization scheme SBF Join based on the Split Bloom filter,in which the method of dynamical bit arrays generation is used,in order to compensate the deficiency of CBF Join when the data table with unknown size is connected.Furthermore,we studied the degradation of the spark join performance caused by the data inclination,and proposes the optimization scheme of Spark join which is called Skew Join in the data tilt scene.In this scheme,connection properties are generated by histogram of the data skew,which are handled separately from normal connection properties,reducing the time consuming and increasing the efficiency of the equivalent connection between large tables.Finally,we validated the above research results by four experiments,comparing the data volume and running time of Shuffle stage,to prove that the performance of Spark distributed computing framework is better than Hadoop distributed computing framework.The CBF Join method presented in this paper is superior to the Spark's Hash Join in the common scenario.The SBF Join method performs better than Hash Join in the unknown scene of data table to be connected;The Skew Join method has better performance than Hash Join and CBF Join in the presence of data skew scenarios in connection properties.The experimental results show that the three optimization schemes proposed in this paper can improve the efficiency of the large scale equivalent operation of Spark SQL.
Keywords/Search Tags:Spark SQL, Compressed Bloom Filter, Split Bloom Filter, Histogram
PDF Full Text Request
Related items