Font Size: a A A

Approximate Query Processing Technology Based On Distribution Perception

Posted on:2022-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:H WuFull Text:PDF
GTID:2518306479493414Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the era of big data,with the rapid growth of data volume and the application of data-driven decision making,large-scale analytical queries have become more and more important,but computing accurate query results on massive amounts of data has become prohibitively expensive.Approximate Query Processing(AQP)is a technology that can quickly provide approximate answers to SQL queries by reducing accuracy in exchange for faster response speed.In the online-sampling AQP technologies,random sampling methods such as reservoir,bernoulli,etc.are widely used,but they are usually suitable for uniformly distributed datasets,and their performance on skewed data is poor.There-fore,this thesis proposes a distribution-aware approximation framework for aggregation queries,which combines the distribution statistics information stored offline to adaptively call different sampling methods to generate samples for the query column sets,so as to answer aggregation queries more efficiently and accurately.Further considering that tra-ditional AQP technologies still have many shortcomings,for example,the query response time of online AQP technologies is relatively long,offline AQP technologies have rela-tively high approximation errors and take up a lot of memory space.This thesis attempts to propose a sampling optimization method that can perceive the distribution of aggregate columns in range queries,and then trains machine learning models on the obtained sam-ples to answer range queries quickly,which can also provide error guarantees for query results.The main work and contributions can be summarized as follows:· An online-sampling AQP technology based on the distribution perception of query column sets for approximating aggregation queries Aiming at the poor performance of traditional sampling methods in processing aggregation queries on skewed datasets,this thesis proposes an online-sampling approximation framework(Aggregation Queries Approximation,AQapprox)that can perceive the data dis-tribution of query columns.The framework builds an offline Map to record the statistical information of the attribute sets that users are interested in on each data segment,and store the corresponding non-parametric statistical test results.When answering a query,AQapprox combines Map to adaptively call different sampling methods and set different sampling probabilities for each data segment.Experimen-tal results on both real-world and synthetic datasets show that,compared with the state-of-the-art work,AQapprox can achieve a speedup by 5.9 to 64 times when answering queries,and has higher approximation accuracy.· An AQP technology based on the distribution perception of aggregate columns for approximating range queries In view of the shortcomings of traditional AQP technologies in terms of approximate accuracy and query response time,some AQP technologies that apply machine learning methods cannot provide error guaran-tees,this thesis proposes a model-driven approximate processing framework(Range Queries Approximation,RQapprox)for range queries.Combined with the analy-sis of query workload,this thesis proposes an optimized sampling method that can perceive the distribution of aggregate columns,then density estimators and regres-sion models are trained on the optimized samples to quickly answer queries,and prediction intervals for the approximate results are provided based on quantile re-gression.When the original data is updated,we use D_statistic to monitor changes in the data distribution,so as to determine whether these models need to be updated.Experimental results on a database benchmark and a real-world dataset show that,RQapprox has higher approximate accuracy compared to the state-of-the-art meth-ods,and the average speedup compared to Verdict DB can be up to 13.80 times.
Keywords/Search Tags:Big Data, Approximate Query Processing, Distribution Perception, Online Sampling, Machine Learning Models, D?statistic
PDF Full Text Request
Related items