Font Size: a A A

Research On Cost-based Query Optimization For Spark SQL

Posted on:2017-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:C L LiuFull Text:PDF
GTID:2308330485485026Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As next generation of general big data computing platform, Spark attracts more and more researchers and enterprises by its capacity of handling data in a fast way. Spark SQL, the structured data analyzing component of spark, is used by more and more organizations and enterprises for analyzing their structured data and mining the valuable information. However, comparing to traditional database technologies and MapReduce-based query engine, Spark SQL lacks principled cost-based query optimization technologies. Therefore, learning from database technologies and MapReduce-based query optimization, this thesis proposes a cost-based query optimizing prototype tailored for Spark to improve the efficiency of query processing.This thesis focuses on proposing a cost-based query optimizing prototype tailored for Spark. Different from MapReduce, Spark can cache the intermediate result in the memory to reduce the cost of disk I/O. This thesis defined the cost model for each Spark’s physical operator of common relational operator, i.e., select, where and group-by. For the two different Spark implementation of equi-join, Shuffle Join and Broadcast Join, we defined cost model for them respectively.This thesis investigated the query optimization in the database system and MapReduce-based query engine field, and did a deep research on cost-based query optimization technologies. For a more accurate estimation of cost that a query plan takes, equi-depth histogram is adopted. The estimation method of the immediate results’ size is based on equi-depth histogram. The prototype uses the estimation of immediate result to determine the joining order of query. The cost models of each physical operator are based on the immediate result estimation and trait of Spark. The prototype uses the models to evaluate every equivalent physical query plan, and chooses the query plan with least cost to submit to the Spark execution engine. Then the execution engine returns the result to user.Finally, we used 4 query tasks to test and verify the prototype in 2 different cluster environments. Comparing the execution time of non-optimized plan and optimized plan, the results show that the prototype improved the performance of query execution by an average of 13.51%.
Keywords/Search Tags:Query optimization, cost model, Spark, Database
PDF Full Text Request
Related items