Font Size: a A A

Research Of Spark SQL Query Optimization Based On Runtime Statistics Collecting

Posted on:2021-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:C F LiuFull Text:PDF
GTID:2428330623467822Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
At the age of big data,huge amounts of data are generated by companies which run their business on the Internet.Companies can make better operation schedule and achieve more profits via analyzing these data.The emerging of Hadoop MapReduce System has significantly simplified the analysis of big data,and it has been widely employed as a business data analysis tool.However,Hadoop MapReduce sometimes can be very slow when performing data analysis task,because it uses disks to store the intermediate results.The distribute compute engine Spark uses main memory to store the intermediate results instead of the disks,which is much faster than Hadoop MapReduce.To further simplify the data analysis work,researchers developed Hive data warehouse system and Spark SQL.These systems using SQL quires express data analysis task,comparing with express data analysis tasks with codes,it can leverage the query optimization technique to optimize the computing task.However,there still are some problems for Spark SQL query optimization:1)It requires operators manually collect data statistics by executing SQL commands.The optimization will not conduct when statistics are absent.However,operators usually have no sense of how to collect statistics effectively and have no idea about what query optimization theory is at all.2)The statistics collected are not accurate enough for a deep optimization.To solve these problems,we proposed a runtime optimizer for Spark SQL,it collects statistics when the query is executing,and dynamic adjusts the execute plan according to the collected statistics.The optimizer includes these three main components:1)BFP(Bloom Filter Prune)Join algorithm.We leverage the Bloom Filter to prune the join input which unsatisfied the join predicate before we perform the connecting operations.The pruned results can be as single-side prune and double-sides prune based on the different method of prune.2)We use AMS Sketch and Bloom Filter to collect more accurate statistics by estimating the join intermediate results.3)A graph-based join planning algorithm is proposed.It schedules the statistics collecting task according to the query to be executed,and adaptively adjusts the query executing plan according to the statistics.Finally,we took a test for our runtime query optimization algorithm.As the experiment shows,when the join order is ignored,our join algorithm performs 12% better than Spark SQL by prune join data first,and when it failed pruned any data,the extra time cost is less than 7%.Although the upfront statistics are absent,our planning algorithm generates best join order 14 times of all 18 tests,the best optimization can reduce 31% of query time cost,and the costs of statistics collecting is less than 5%.
Keywords/Search Tags:Query Optimization, Spark SQL, Runtime, Bloom Filter, Sketch
PDF Full Text Request
Related items