Font Size: a A A

Research On Cardinalities Estimation Of Two Table For Join Operator Based On Spark SQL Platform

Posted on:2019-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:S J YeFull Text:PDF
GTID:2428330590950396Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,the mobile internet industry and the internet of things industry have experienced considerable and explosive growth.In the fields of communications,logistics,finance,industrial internet of things and the internet,a wide variety of endu devices generate a large amount of structured data at all times.With the massive data generating,the traditional software technology framework has been difficult to meet the needs of big data applications,so HDFS,MapReduce,HBase,Hive and many other Hadoop technologies have emerged.Spark is an open source parallel architecture based on Hadoop and MapReduce.The intermediate results can be stored in memory without storing and reading HDFS like other Hadoop systems.Therefore,Spark has higher computational efficiency for a variety of algorithms and has been widely used.Spark SQL is an important module of the Spark system,which is responsible for providing the DataFrame API interface to perform relational operations on internal and external data sources.Spark SQL also provides the Catalyst optimizer to add special optimization strategies to applications such as machine learning..The Catalyst optimizer will use a rule-based optimization or a cost-based optimization for logical plan and physical plan optimization of Spark SQL statements.In the cost-based optimization strategy,the statistics of the Spark operation needs to be used to estimate the cost of the operation,and the accuracy of the statistical information directly affects the result of the cost optimization.It is well known that join operation is one of the most complex relational operations in Spark SQL while the join cardinality estimate is the most difficult information to acquire in join operation cost optimization.The current method of estimating the join cardinality in Spark SQL has better effect when the data is evenly discrete.However,in the case of uneven data discrete,the characteristics of the data itself cannot be characterized,and the estimation result is often on the order of magnitude compared with the real value,which greatly affects the outcome of cost optimization.In view of the low accuracy of join cardinality estimation in Spark SQL Catalyst,this paper designs a sliding window sampling strategy for data source join operation history,extracts features and gain the more accurate result of join cardinality using multi-layer perceptron and polynomial regression respectively.Finally,this paper uses two types of synthetic datasets and a real dataset from a IoT platform production to evaluate the two join cardinality estimation algorithms.The experiment shows that in the estimation of the join cardinality of the real dataset,the two algorithms gain 6.03 times and 7.17 times improvement in relative error rate compared with the default algorithm respectively.
Keywords/Search Tags:Spark SQL, Catalyst, the cardinality of the Join, Multilayer Perceptron, Polynomial regression
PDF Full Text Request
Related items