Font Size: a A A

Research On Online Aggregation Query Optimization Based On Spark

Posted on:2019-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:H GongFull Text:PDF
GTID:2428330596460868Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the application and popularization of social networks,internet of things,and e-commerce,today's data has grown explosively.Online aggregation has been introduced into big-data processing platform because it obtains approximate solutions through sampling,avoids scanning the entire table,and improves the execution efficiency of aggregation operations in the database.However,the current research only focus on the realization of online aggregation system on Spark and does not consider optimization,resulting in implementation of tilt data is not efficient,the connection operation becomes a bottleneck in online aggregation system based on the Spark platform.In order to solve these problems and improve the overall execution performance of online aggregation on Spark,this paper studies the above issues in online aggregation.For the problem of skewed data processing in online aggregation,this paper synthetically considers the frequency of the attribute columns in the historical query,the degree of skew,the storage load of the stratified samples,and other conditions to establish a mixed integer linear programming model and select the appropriate attribute columns to build a stratified sample.We design single table query algorithm based on layered sample and give interval estimation formula based on stratified sample sampling,which effectively improves the query efficiency of online aggregation to tilt data.For multi-table join problems in online aggregation,this paper uses indexes to reduce the number of samples.Firstly,considering the frequency of the connection attributes,the storage load of the index and other conditions,an optimization model is established to select appropriate columns to establish index.According to the index,we design the Index Ripple Join algorithm in two tables,and give the formula of the interval estimate of Index Ripple Join,which makes the estimation result satisfy the unbiasedness.Afterwards,we study the extension of the two-table join to multi-table joins.According to the join conditions and indexes,the multi-table join is abstracted as a Join graph,and the Join connceting tree branch is generated based on the Join graph to generate a multi-table join execution plan.For the nested query problem in online aggregation,this paper implements the nested query algorithm in G-OLA and combines it with skewed data processing and multi-table query optimization to improve efficiency of nested query in Online Aggregation.This article designs and develops an online aggregation prototype system SOLA based on the big data platform Spark,combined with Hive,we deploy and test it in the dawn cluster.The test results show that compared with the existing algorithms,the proposed model and algorithm can significantly reduce the number of sampling,increase the execution efficiency,and make a positive attempt for the technical development of the online aggregation field.
Keywords/Search Tags:Online Aggregation, Skew data, Bootstrap, Multi-table Join, Nested Query
PDF Full Text Request
Related items