Research On Cost-based Query Optimization For Spark SQL

Posted on:2017-03-16

Degree:Master

Type:Thesis

Country:China

Candidate:C L Liu

Full Text:PDF

GTID:2308330485485026

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As next generation of general big data computing platform, Spark attracts more and more researchers and enterprises by its capacity of handling data in a fast way. Spark SQL, the structured data analyzing component of spark, is used by more and more organizations and enterprises for analyzing their structured data and mining the valuable information. However, comparing to traditional database technologies and MapReduce-based query engine, Spark SQL lacks principled cost-based query optimization technologies. Therefore, learning from database technologies and MapReduce-based query optimization, this thesis proposes a cost-based query optimizing prototype tailored for Spark to improve the efficiency of query processing.This thesis focuses on proposing a cost-based query optimizing prototype tailored for Spark. Different from MapReduce, Spark can cache the intermediate result in the memory to reduce the cost of disk I/O. This thesis defined the cost model for each Spark’s physical operator of common relational operator, i.e., select, where and group-by. For the two different Spark implementation of equi-join, Shuffle Join and Broadcast Join, we defined cost model for them respectively.This thesis investigated the query optimization in the database system and MapReduce-based query engine field, and did a deep research on cost-based query optimization technologies. For a more accurate estimation of cost that a query plan takes, equi-depth histogram is adopted. The estimation method of the immediate results’ size is based on equi-depth histogram. The prototype uses the estimation of immediate result to determine the joining order of query. The cost models of each physical operator are based on the immediate result estimation and trait of Spark. The prototype uses the models to evaluate every equivalent physical query plan, and chooses the query plan with least cost to submit to the Spark execution engine. Then the execution engine returns the result to user.Finally, we used 4 query tasks to test and verify the prototype in 2 different cluster environments. Comparing the execution time of non-optimized plan and optimized plan, the results show that the prototype improved the performance of query execution by an average of 13.51%.

Keywords/Search Tags:

Query optimization, cost model, Spark, Database

PDF Full Text Request

Related items

1	Cost Model And Query Optimization For In-memory Column-stored Database
2	The Query Execution Optimization In Spark SQL
3	An Ad-hoc Query Engine Based On Spark SQL
4	Research On Data Query Processing And Optimization In Distributed Database
5	Distributed Database Query Optimization Techniques
6	Design And Implementation Of Cost Estimation Model In Da Meng DBMS
7	The Optimization Of Spark SQL Based On Cost
8	Query Optimization In SQL To Spark
9	A Research Of Generation And Optimization Of Query Plan Based On Graph Database
10	Application And Research On Multi-join Query Optimization Of Database Based On GA