The Query Execution Optimization In Spark SQL

Posted on:2019-03-06

Degree:Master

Type:Thesis

Country:China

Candidate:Y T Wan

Full Text:PDF

GTID:2428330596460905

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet,the scale of data generated and processed by governments,enterprises and research institutions has reached to TBs even PBs every day.Though Hadoop provides reliable storage and processing for big data on multiple computers,this framework has some certain problems like storing the intermediate data in the HDFS file system,which can bring huge random disk I/O cost.Thus the Spark technique emerge as the times require.Spark is a distributed memory-computing framework and uses a faster workflow DAG(Directed Ayclic Graph)based the MapReduce calculation model of Hadoop.For reducing the frequency of Shuffle,Spark allows more data to be read and written in memory.However,the Shuffle phase in Spark still put the intermediate data on hard disk for reading and writing,and there is also a situation of reading and writing redundant data in the workflow of Spark SQL.Targeting on the existing issues,this paper invertigates the query execution optimization of Spark SQL.This thesis gives an in-depth analysis of Spark SQL and combines the features of SQL queries.We add an intermediate data cache layer between the underlying persistent file system and the upper Spark core to solve the random disk I/O.By using the Query pre-analysis module,SSO can dynamically adjust the size of cache layer according to different queries.In addition,histogram method is proposed to deal with the situation of calculating the Shuffle size in joint operation.Finally,the allocation module can allocate proper memory size for each node in cluster.We also provide a combination algorithm based on cost model,this method can weigh the earnings and the costs generated by sharing data and determine whether to merge the related jobs or not.These two aspects can realize the efficient use of cluster resources and speed up the execution of Spark SQL query tasks.We designed and implemented the SSO(Spark SQL Optimizer)system to achieve all the functions mentioned above,and then compared the query performance with Spark SQL based on the benchmark provided by TPC-H.The experimental results demonstrate that SSO system has significant advantages in improving the query speed,reducing redundant I/O cost and decreasing memory usage.

Keywords/Search Tags:

Spark, Spark SQL, intermediate data caching, cost-based optimization

PDF Full Text Request

Related items

1	The Optimization Of Spark SQL Based On Cost
2	Research And Implementation On Caching Strategy In Spark
3	Design And Implementation Of Telecom 4G Big Data Platform For Network Optimization Based On Spark
4	Research On Task Execution Optimization In Spark
5	Research And Implementation On Anti-skew Spark Intermediate Data Partition Mechanism
6	Research On Cost-based Query Optimization For Spark SQL
7	Cost-based Configuration Optimization Analysis For Apache Spark
8	Research On Spark Caching Strategy Based On Task Structure Optimization
9	An Intermediate Data Placement Algorithm For Load Balancing In Spark Computing Environment
10	Implementation And Optimization For Join Operation In Spark