Font Size: a A A

Key Technologies Of Big Data Approximate Computing Based On Synopses

Posted on:2022-08-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:M F ZhangFull Text:PDF
GTID:1488306569984249Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the era of big data,the amount of data is rapidly expanding,which brings great challenges to big data computing,query processing,and analysis.Traditional data computing techniques are difficult to handle the ever-increasing amount of data.To this end,researchers made trade-offs among the accuracy,storage cost,and efficiency.They proposed methods to obtain data synopses from big data and use the synopses to solve the big data approximate computing problems.Data synopses are the summarised information obtained from big data,which include samples,sketches,histograms,data cubes,wavelets,etc.Approximate computing based on data synopses can greatly reduce the amount of data to be processed,effectively improve the efficiency,and meet the needs of rapid responses.Generally,the more information in data synopses,the higher the accuracy of the final estimations.That is,selecting more samples or storing more buckets in a histogram will bring more accurate estimation results.It means that these methods need to increase the accuracy at the expense of increased space cost.To cope with the ever-expanding amount of data in today's era,we hope to improve the accuracy while maintaining lightweight data synopses.To this end,we have proposed new data synopses and methods for the four important issues in big data computing,including the frequency estimation,selectivity estimation,approximate aggregate query processing,and approximate group-by query processing.These proposed methods are more accurate than the methods based on traditional data synopses,while occupying less space.The main research contents of this paper are summarized as follows.1.We investigate the frequency estimation for big data.Frequency estimation is widely used in many fields such as network traffic analysis,network monitoring,anomaly detection,and finding frequent items.The huge amount of data makes it costly to store and query the true frequency,and the frequency estimation algorithms based on the sub-linear sketches can greatly reduce the storage cost.In the traditional count-min sketch,due to its scheme of mapping data to sub-linear space through hash functions,it faces huge estimation errors caused by hash collisions.To this end,this paper proposes two learned sketches to build frequency models according to the historical data,and separate highfrequency data and low-frequency data.The proposed methods use models to replace the storage of high-frequency data and their frequencies,and use the traditional countmin sketch to process low-frequency data,thereby avoiding huge errors caused by the storage collisions between the high-frequency data and low-frequency data.At the same time,using a lightweight model instead of a large amount of data frequencies' storage can effectively reduce the space cost.The experimental results on real-life and synthetic datasets verify that the two proposed learned sketches can provide more accurate frequency estimations at a lighter storage cost than traditional sketches.2.We investigate the multidimensional selectivity estimation for big data.Query selectivity refers to the proportion of tuples satisfying the query predicates to all the tuples.The accuracy of selectivity estimation determines the result of query optimization.Database systems usually use one-dimensional histograms to estimate the selectivity.But estimating the selectivity of a multidimensional range query based on the attribute independent assumption is usually inaccurate.Multidimensional histogram is an approach to this problem,but it also faces some challenges including the construction difficulty and the huge space cost.The bucketing ways of equi-depth histograms and V-optimal histograms diversify as the dimensionality increases,and it is difficult to determine the optimal bucketing method;the equi-width histograms are easy to bucket,but difficult to deal with skewed data distribution.Existing histograms usually improve the accuracy at the cost of increasing the number of buckets,and as the number of dimensions increases,increasing the number of buckets leads to additional space cost.In order to improve the accuracy and avoid increasing the storage cost of a histogram,this paper proposes a method that uses a density model learned from a large number of buckets in the dense regions to replace the storage of these buckets.It combines the density model with some histogram buckets to estimate the selectivity.This paper conducts experiments on real and synthetic datasets to verify that the algorithm proposed in this paper can outperform a variety of existing representative multidimensional histograms at the same storage cost.3.We investigate the approximate aggregate query processing for big data.Calculating the accurate aggregate query results on big data is undoubtedly expensive,and it is difficult to meet the requirement of fast response.To solve this problem,approximate aggregate query processing came into being,which efficiently provides an approximate result of a query.Sampling-based methods estimate the query results on the overall data based on the results on the sample.The accuracy of a sampling-based method relates to the sample size.A small number of samples cannot represent the overall data distribution,especially the distribution of skewed data.Pre-aggregation-based methods estimate new queries according to the pre-computed aggregations.However,if the range of the new query does not intersect with the pre-computed aggregations,these pre-computed aggregations have little help on the estimation.Machine-learning-based method trains regression models according to the historical queries,and uses the models to estimate query results,but it does not provide error bounds.In this paper,we propose a new method that combines sampling,pre-aggregations and machine learning models,it comprehensively utilizes the advantage of each method.It trains an error model based on pre-aggregations and a small sample,and chooses the optimal pre-computed query as the baseline for the current query according to the predicted error of the current query.This paper conducts experiments on real and synthetic datasets to verify the superiority of the proposed algorithm compared with the methods based on sampling,pre-aggregations,and machine learning models,respectively.4.We investigate the approximate group-by query processing for big data.Sampling is one of the main approaches to approximate query processing.But sampling-based approximate query processing cannot estimate group-by queries well.The result of a group-by query contains multiple values,and the data distributions of different groups are different.It is difficult to obtain enough samples for some rare groups through uniform sampling;stratified sampling can improve the accuracy,but the stratified samples obtained for some queries cannot be applied to other queries;online sampling can obtain samples for a given query,but sampling at query time undoubtedly increases the response latency.To this end,we propose a sample generation method based on a conditional generative model,which can obtain samples without accessing the raw data,and it can generate any number of targeted samples for a given group.It can be flexibly combined with stratified sampling methods to improve the accuracy while reducing the response latency.In addition,this paper proposes a sample generation method for group-by queries with low selectivities,which can generate high-quality stratified samples for given predicates.This paper conducts experiments on real and synthetic datasets to verify the high efficiency and accuracy of the proposed algorithms.
Keywords/Search Tags:Approximate Query Processing, Frequency Estimation, Selectivity Estimation, Aggregate Queries, Group-By Queries
PDF Full Text Request
Related items