Font Size: a A A

Research Of Approximate Query Processing Technology For Large Scale Data

Posted on:2021-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:S ZhangFull Text:PDF
GTID:2428330647461939Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,the era of big data has come,a variety of information collection systems have produced a large number of data.Enterprises and organizations expect to obtain valuable information from a large number of data for business decision making or exploratory analysis.Massive data usually results in slow accurate query speed and are time consuming.But in the specific situation,such as trend prediction and transaction decision making,usually only need to obtain the general development trend,and do not require accurate results.Therefore,approximate query processing technology can be applied to these scenarios to quickly obtain analysis results in the case of losing certain query accuracy.This paper does the following work for approximate query processing technology:(1)A method based on statistical summary and clustering is proposed for the range query of one column in distributed platform,and extend it to group query,join query and sub-query,and establish the query system based on HDFS and Spark.Firstly,.the groups are established for aggregation column by clustering algorithm,and the number of elements and center points of each group are counted,which are stored in the file system as the summary information.During the query,the summary table is directly used for rewriting to quickly obtain the results.For group query,a statistical summary is established for the aggregation column corresponding to each group,and the final result is obtained by using the summary information.For join queries,unjoined data is first filtered through Spark based on bloom filter,and a statistical summary of non-primary key join columns is created to convert join query to single table query.For the subquery,summary tables can be directly used to finish the query.Experimental results show that the statistical summary method can effectively improve the query speed and query accuracy compared with other proposed methods.(2)A method based on the regression model and the statistical summary is modified and extended for range query with multiple columns,and extend it to group query and join query,and build a distributed query system based on HDFS and Spark.Firstly,use spark to read the data into memory.Then respectively using uniform sampling and stratified sampling method to obtain samples from the uniform data and tilt data,the range query column is as the features,aggregation column is as predicted value,establish regressionmodel,and calculate the density function,through function integral to get the range query result.The scope of the upper value and lower value as the upper bound and lower bound of the integral,if the range queries only given upper value or lower value,using the maximum or minimum value instead of another bound.Combined with the statistical summary based method,the maximum,minimum and total number of samples used in the algorithm can be calculated quickly.For skewed data,stratified sampling is carried out to prevent some groups from being too small to obtain samples in group query,which may result in query missing.For join query,the tables that joined can be converted to single table based on statistical summary method,and then use it to create data model.The experimental results show that the model method can effectively improve the query speed,and increase the query accuracy compared with other proposed methods.
Keywords/Search Tags:Approximate query processing, distributed system, regression model, statistical summary, clustering, join query
PDF Full Text Request
Related items