Font Size: a A A

Research On Approximate Aggregate Query Processing Over Low Usability Data

Posted on:2020-12-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:A Z ZhangFull Text:PDF
GTID:1368330590472973Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,dirty data comes along when data size grows gradually.Dirty data leads to the low usability of the big data.Low usability data refers to data with errors cannot be repaired.Approximate computing on dirty data(such as query,analysis,mining)is an important problem.Approximate computing on dirty data is different from traditional approximate computing.It gives solutions which satisfy the required accuracy on the dirty data,which is inconsistent,incomplete,inaccurate,incurrent or duplicate.At present,query processing over dirty data mainly includes the two following methods.First,we repair the dirty data and then evaluate the query over the repaired data.Second,we directly evaluate the query over the dirty data and return the results which satisfy all possible repair plans.In the first approach,there exists no repair approach that can guarantees the accuracy of the query result.In the second approach,it may result in available information lost.Hence,this paper studies the problem of query processing over dirty data.We focus on aggregate query processing over incomplete data,inconsistent data and duplicate data.The main work of this paper is in the following.Firstly,we study the problem of aggregate query processing over applicable incomplete data.Incomplete data is also called missing data.Since imputation methods have no guarantees on the accuracy of the query results,this paper proposes an interval estimation method for aggregate query over incomplete data.Suppose the aggregate attribute domain is known and the missing data can be imputed.We directly estimate the query result rather than impute the missing data.We estimate the upper bound ub and lower bound lb of the aggregate query results among all possible worlds,and give the interval estimation of the query result.The ground-truth query result is guaranteed to be among the interval [lb,ub].Our main techniques are parameter-free and do not assume prior knowledge about the distribution and missingness mechanisms.If the domain of the aggregate attribute is known,we can provide linear-time algorithms for SUM,COUNT and AVG queries.Secondly,we study the problem of aggregate query processing over inapplicable incomplete data.Suppose the aggregate attribute domain is unknown and some of the missing data can be imputed others cannot.We give the interval estimation of the aggregate query result.This paper extends the relational model under denotational sematic,which can cover all types of incomplete data.We define a new semantic of aggregate query answers over incomplete data,reliable answers.Reliable answers are interval estimations of the ground-truth query results,which can cover the ground-truth results with high probability.For SUM,COUNT,AVG queries,we propose linear approximate evaluation algorithms to compute reliable answers.Thirdly,we study the problem of aggregate query processing over entity-conflict data.Entity resolution refers to the process of identifying the duplicate data.Since entity resolution algorithms are inaccurate and inefficient,we propose CrowdOLA,a novel framework for aggregate query processing on entity-conflict data.Instead of cleaning the whole dataset,CrowdOLA directly estimate the query result on entity-conflict data.CrowdOLA retrieves samples continuously from the dataset,and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion.An unbiased estimator is provided to address the error bias that is introduced by the duplication.We improve the query efficiency with the guarantee on confidence level.Finally,we study the problem of aggregate query processing on inconsistent data.At present,a famous solution on this problem is consistent query answers,which return the query result that satisfies all possible reparations.However,consistent query answers may lose a lot of useful information,and the computation complexity is too high.This paper design a new method based on minimum spanning tree on uncertain graphs to compute the reparation and query result with maximum deterministic probability.The reparation is with lowest cost and highest probability.Therefore,the query result is guaranteed to be with highest accuracy.
Keywords/Search Tags:data quality, approximate query processing, data reparation, result estimation, data usability
PDF Full Text Request
Related items