Font Size: a A A

The Optimization Of Spark SQL Based On Cost

Posted on:2019-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:X LianFull Text:PDF
GTID:2428330590965737Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the era of big data,mining valuable data from mass information and processing the mass information in a timely and rapidly manner has become an urgent need for all industry.Spark SQL is a distributed query component which based on Apache Spark.The retrieval performance of Spark SQL has been improved by taking advantage of the characteristic of in-memory computing in Spark under the massive data processing scenario.Spark SQL has been widely applied in data cleaning,data mining and log analysis fields.The query optimization is the core of Spark SQL.In distributed query system,the cost of computing resource,memory resource,network resource and disk I/O resource would be depend on the implementation ways of Join operation and the path of join execution.Hence the Join Optimization is key to improve the query performance.In the latest version of Spark SQL,it has support the rule-based optimization and cost-based optimization.By using query optimization strategy,the execution path could be reordered.And by using caching strategy,the cost of network transmission and disk operation would be reduced.Finally the query performance would be improved.However,the cost-based optimization of Spark SQL has not fully considered the characteristic of in-memory computing in Spark.And the caching strategy in Spark SQL is relatively single.For these problems,this study focus on the optimization strategy of Spark SQL and the main works are as follows:1.A cost estimation models for each kind of Join Operator implementations are proposed which comprehend the time/space complexity and I/O cost.This cost estimation models analysis the cost of memory usage behavior and data spill behavior during Spark SQL execution.And a physical plan generation strategy and optimal physical plan selection strategy are proposed.The experimental results shows that the query optimization strategy proposed by this study has improved the query performance and system resource utilization by comparing with the latest Spark SQL Platform.2.Since the Spark SQL cannot automatically cache valuable data which will cause the problem of low utilization of cache space.In this study,a cost-based automatic caching strategy is proposed by analyzing the cost of cache read and write according to the characteristic of in-memory columnar storage.The strategy proposed by this study was verified by experiment.The experimental results show that the strategy can effectively recognize the valuable table when using the TPC-DS as benchmark.By caching the valuable data into the memory,the performance of query operation and the utilization of system resources has been improved.The research work shows that the cost-based query optimization strategy,which combined with the characteristic of in-memory computing in Spark,could both improved the utilization of system resources and the query performance.
Keywords/Search Tags:Spark SQL, Cost Optimization, Join Operator, Physical Plan, Automatic Caching Strategy
PDF Full Text Request
Related items