Font Size: a A A

The Query Execution Optimization In Spark SQL

Posted on:2019-03-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y T WanFull Text:PDF
GTID:2428330596460905Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the scale of data generated and processed by governments,enterprises and research institutions has reached to TBs even PBs every day.Though Hadoop provides reliable storage and processing for big data on multiple computers,this framework has some certain problems like storing the intermediate data in the HDFS file system,which can bring huge random disk I/O cost.Thus the Spark technique emerge as the times require.Spark is a distributed memory-computing framework and uses a faster workflow DAG(Directed Ayclic Graph)based the MapReduce calculation model of Hadoop.For reducing the frequency of Shuffle,Spark allows more data to be read and written in memory.However,the Shuffle phase in Spark still put the intermediate data on hard disk for reading and writing,and there is also a situation of reading and writing redundant data in the workflow of Spark SQL.Targeting on the existing issues,this paper invertigates the query execution optimization of Spark SQL.This thesis gives an in-depth analysis of Spark SQL and combines the features of SQL queries.We add an intermediate data cache layer between the underlying persistent file system and the upper Spark core to solve the random disk I/O.By using the Query pre-analysis module,SSO can dynamically adjust the size of cache layer according to different queries.In addition,histogram method is proposed to deal with the situation of calculating the Shuffle size in joint operation.Finally,the allocation module can allocate proper memory size for each node in cluster.We also provide a combination algorithm based on cost model,this method can weigh the earnings and the costs generated by sharing data and determine whether to merge the related jobs or not.These two aspects can realize the efficient use of cluster resources and speed up the execution of Spark SQL query tasks.We designed and implemented the SSO(Spark SQL Optimizer)system to achieve all the functions mentioned above,and then compared the query performance with Spark SQL based on the benchmark provided by TPC-H.The experimental results demonstrate that SSO system has significant advantages in improving the query speed,reducing redundant I/O cost and decreasing memory usage.
Keywords/Search Tags:Spark, Spark SQL, intermediate data caching, cost-based optimization
PDF Full Text Request
Related items