Font Size: a A A

Research On Approximate Query Processing Over Inconsistent Data

Posted on:2018-08-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L LiuFull Text:PDF
GTID:1368330566498394Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data quality is the foundation,premise,and guarantee that make sure the validity and accuracy of data analysis results.Among the numerous factors causing data quality problem,inconsistency is a dominant one.A relational database is inconsistent if it does not satisfy a given set of integrity constraints.With the widespread of the Internet,applications get data from different data sources,which make the spread of inconsistent data intensified.Inconsistent data presents a serious challenge to query processing,which results in query results inaccurate.Relevant work to process inconsistent data mainly includes two methods.The first method repairs inconsistent data and then executes queries over repaired data.In view of the fact that one hardly knows which repair is correct.The second method executes the query over inconsistent data directly but returns consistent answers such that contained in the answer set for all possible repairs.However,it causes information loss heavily by dropping uncertain answers.In this paper,we first repair data using absolutely accuracy value by master data or domain expert,to get the inconsistent weak available data.Then we do research on query processing over inconsistency weak available data.For convenience,we use inconsistent data to denote inconsistent weak available data in the following context.The main research work is outlined as follows:Firstly,we study the decision problem of approximate query feasible over inconsistent data.Here,a query is feasible when the accuracy of the query results does not exceed the tolerable threshold value defined by users.We show that if queries can be estimated before implementation by a relatively small price,the efficiency of the whole query processing will improve greatly.It gets more gains for large data sets and in large query overhead applications.In this paper,we design a sampling method to estimate the query results accuracy degree.The sampling method takes samples separately from the consistent part and inconsistent part of inconsistent weak available data,by different sampling strategies.Then based on the samples,the accuracy degree is estimated.We prove that the estimate is gradual unbiased.Also,we prove that the estimate is a(epsilon,delta)estimate.Secondly,we study query result evaluation problem over inconsistent data.Query results over inconsistent data may be incorrect.We design a novel query model for inconsistent data.The new model computes the query results like does in the traditional database,while returns the consistency degree for the results.The consistency degree is defined by the proportion of consistent results in the whole result set.Here consistent results are a maximal certain query result set which can be got over each of all possible repairs.By the new query model,users can get as much as useful information and the overall characteristics of the query results.Specifically,for a query that can be rewritten by first-order logic,the results consistency degree will be computed by doing rewritten query directly.Otherwise,this paper designs a sampling method to give an approximate estimation.We prove that the estimate of results consistency degree is a(epsilon,delta)estimate.Then,we study the aggregation query over inconsistent data.Inconsistency presents challenges for aggregation query.Firstly,different repairs of inconsistent data may result in different aggregate values.i.e.,the aggregation results over inconsistent data are uncertain.Secondly,the number of repairs of inconsistent data may be exponential size,which may cause the huge size of aggregate results.To cope with them,we return the range of all possible aggregate results,that is,the least upper bound and the largest lower bound of all possible results.This paper considers five aggregation operators,including MAX,MIN,SUM,COUNT and AVG.In most cases,the bound can not compute in polynomial time,and then we design corresponding approximation algorithms for them.Finally,we present a prototype system,called Entity Manager,to manage inconsistent data.In the real world,multi-descriptions or multi-value of the same entity is a common factor that leads to inconsistent information.To deal with this situation,the current approach mainly adopts entity recognition technology to identify the tuples that describe the same entity,and then repair it to get the most accurate value.But in practice,the value of attributes of an entity may not be unique.Hence,the identify-and-repair method will result in available information lost.In view of these problems,Entity Manager system takes real-world entity as the basic storage unit,retrieves query results according to the quality requirement of users,and is able to handle all kinds of inconsistent data recognized by entity resolution.This paper elaborate Entity Manager system,covering its architecture,data model,and query processing techniques and query optimization techniques.
Keywords/Search Tags:consistent query answering, aggregation, result estimation, approximation
PDF Full Text Request
Related items