Font Size: a A A

Technology For Answering Queries On Incomplete Data

Posted on:2020-01-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y N LiuFull Text:PDF
GTID:1368330590472857Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with the arrival of big data era,the importance of data quality is more and more significant.Due to errors among digitization of information,information contained in databases cannot completely reflect the real physical world.It is widely reported that the obsolete data has brought biases in computation results,which has seriously widely affected commercial decisions and citizen's lives,and brought challenges to query processing.Therefore,it highlights the needs of designing technology for answering queries on incomplete data,which plays a very important role in taking advantage of low usability data effectively.Present research of data usability on technology for answering queries on incomplete data is still immethodical,and is faced with great challenges.First,there does not exist a uniform data completeness evaluation model,which can be used to evaluate the real extent to which a data set contains information.Second,faced with data sets with missing values hard to repair,there are no methods to offer answers with as much complete information as possible,according to a user's interests.Third,on incomplete data,taking both the extent of complete information and an aggregation object into consideration,there are no methods to provide high-quality answers with some quality errors.Fourth,since there are no methods to quickly estimate the extent to which a query answer contains information,a method should be designed to draw a subset of data for estimation.Such a subset reflects completeness characteristics of the whole data,which can be used for efficient estimating methods.To address the challenges outlined above,without repairing missing values,in this thesis,taking advantage of features of relational data,technology for query answering on incomplete data is offered to provide answers with more complete information,and a series of theories and corresponding efficient algorithms are proposed to solve the key problems of query answering on incomplete data.The research issues and contributions of this thesis are summarized as follows.(1)In this thesis,the problem of modeling and determining data completeness is studied.To overcome the limitations of dependency on specific queries,and underestimate on the real extent of complete information by present methods,this thesis formally proposes a novel data completeness model based on functional dependencies with different granularities.This model consists of completeness information of attributes,tuples and relations.Based on this model,the problem of determining data completeness is firstly studied and formally defined,and a lower bound of time complexity of this problem is provided.To take advantage of functional dependencies,completeness pseudo closure for determining data completeness is defined.Then properties of completeness pseudo closure are analyzed,and completeness propagation graphs are built to determine data completeness.Based on completeness propagation graphs,an efficient algorithm is designed,whose time complexity reaches the lower bound of this problem.Finally,the efficiency and effectiveness of the proposed algorithm are verified by experimental results on real-life and synthetic data sets.(2)In this thesis,the problem of answering queries based on a dominant set on incomplete data is investigated,and efficient algorithms are designed.When some missing values cannot be repaired or need a long time to repair,according to a user's interests,it is a feasible way to select a size constrained subset with high completeness tuples on an incomplete data set providing complete values on interesting attributes.Such a subset is called a dominant set.Based on a dominant set,queries can be further answered effectively,and answers can be offered efficiently.The problem of selecting a dominant set is firstly formalized in this thesis,and then its decision version is proved to be NP-Complete.After that,an efficient approximate algorithm is designed to draw a dominant set.By theoretical analysis,the drawn dominant set are with well properties.Strategies based on a dominant set for further answering queries are given.Experiments are conducted on real and synthetic data sets to analyze the efficiency and effectiveness of algorithms,and impacts of parameters are analyzed.(3)In this thesis,the methods for answering queries with completeness constraints on incomplete data are studied,and efficient algorithms are given.On incomplete data sets,query answers are always without enough information.Therefore,a query answer form for incomplete data is proposed.Within some error bound of quality,a query answer not only provides complete information on some attributes of a user's interest,but also approximately satisfies an aggregate property to form a high-quality group.The problem of answering queries with completeness constraints is studied.First,the problem is formalized and its decision version is proved to be NP-Complete.Second,depending on whether an explicit strategy for selecting tuples is given by a user or not,based on greed and weighted sampling respectively,two approximate polynomial algorithms are proposed.For the two algorithms,time complexity and quality of query answers by the two algorithms is theoretically analyzed and proved.Experimental results on real and synthetic data sets are given to show high-quality answers by the two algorithms,efficiency and well scalability of the two algorithms.(4)In this thesis,the problem of estimating query completeness on incomplete data is investigated,and an efficient algorithm for estimation is designed.Since there are no metrics to measure completeness information of an entire data set,taking advantage of a subset reflecting completeness characteristics of the whole data set,it is feasible to estimate completeness of a query answer from an incomplete data set.To this end,this thesis proposes two properties for a characteristic set: coverage and completeness,which measure the extent to which complete information is contained on attributes and attribute values respectively.To satisfy these two properties and different requirements,6 kinds of characteristic set are defined.The decision version of the problems of drawing any kind of characteristic set are proved to be NP-Complete.Then,strategies for guessing the optimal size and assigning error bounds are designed to comprehensively use tuples with different numbers of missing values for the two properties.Based on uniform sampling and the strategies,an efficient drawing algorithm is proposed,which can draw a set of tuples approximately satisfying the two properties in polynomial time.An efficient estimation algorithm is proposed based on the drawn characteristic set.By the theoretical analysis,well properties of proposed algorithm are proved.Experimental results on real and synthetic data sets are given to show query completeness can be estimated efficiently.
Keywords/Search Tags:Data Quality, Data Usability, Data Completeness, Query Processing
PDF Full Text Request
Related items