Font Size: a A A

Research On Computational Complexity Theory And Algorithms For Data Consistency

Posted on:2017-02-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:D J MiaoFull Text:PDF
GTID:1108330503969766Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
A mount of data of low quality had been produced along with the explosive growth of the amount of information in many application field. It led to disastrous consequences,limited the scope of the data greatly. The low quality of data can show up in general as inconsistency, inaccuracy, incompleteness and non-currency, inconsistent data is most typical. It is necessary to improve the inconsistent data, so as to guarantee the data utility.However, the existing method and theory of consistency describing with data dependencies has not been well developed, and such kind of method has several intrinsic shortcomings, mainly as: First, integrity constraints are good at detection but not at evaluating the degree of data inconsistency; Second, integrity constraints can not guide the data repairing, automatic repairing method can not be sure to repair the inconsistent data completely correct; Third, it is hard to make out the right and complete rule set for inconsistency detection and data repairing. This paper performs the research on several underlying problems on both computational complexity and algorithmic aspects, in the rules discovering from probabilistic data, inconsistency evaluation, data repair and querying inconsistent data, solve these problem well.(1) This paper performs the research on the algorithm of data consistency rule discovering. To overcome the shortcoming of mining functional dependency and conditional functional dependency, and to avoid the generating of the low quality, redundant rules,this paper studies the definition of the relaxation of fd and cfd, and the way mining such relaxed rules from data set with data of general type. This paper proposes the method of mining approximated conditional functional dependency from probabilistic database,since the probabilistic database is the generalization of the normal data, and the approximation conditional dependency is the relaxation of the standard conditional functional dependency. Such method adapts more general data, provides more rules with quality control of expert users. Based on this, the probabilistic approximated functional dependency is defined and several pruning criterions are proposed. Then, experiments on both synthesized and real data show the scalability of the framework, effectiveness of the pruning criterion, and show some interesting mining result.(2) This paper performs the research on the complexity and algorithm for the database inconsistency evaluation. We define and use minimum tuple deletion problem to evaluate the database inconsistency. Based on this, we study the relationship between the size of rule set and the complexity of the problem. We show that the minimum tuple deletion problem is still NP-complete, if given two cfds and three attributes involved in the cfds;And it is NP-hard to approximate the minimum tuple deletion problem within1716 if given three cfds and four attributes involved in the cfds. We design a near optimal approximated algorithm for computing the minimum tuple deletion, the ratio is 2-12 r, where r is the number of the cfds in the given rule set Σ. Under the unique gaming conjecture, this ratio is near optimal, it’s hard to improve it with a constant independent of n. The experiments show that our algorithm can return a very accurate solution.(3) This paper investigates the data repairing method by using query feedbacks, formally studies two basic decision problems, functional dependency restricted deletion and insertion propagation problem, corresponding to the feedbacks of deletion and insertion.Due to the different operations queries defined with, we also give a complete analysis on both combined and data complexity of the cases defined by different operations and feedback types. We have identified the intractable and tractable cases, and give the efficient algorithm on these tractable cases. Concretely, for the functional dependency restricted deletion propagation problem, we give a trichotomy for the combined complexity of conjunctive queries, and prove that if the query is defined by selection join operation, this problem is tractable in quadratic time. For the functional dependency restricted insertion propagation, this paper gives the combined and data complexity of monotone query, and prove that if the query is defined by self-join free selection join operation, this problem is tractable on data complexity aspect. Further than this, based on the results above, for the case with both two feedback types, we identify a class of tractable case on data complexity aspect, i.e. self-join free selection join query, so as to extend the scope of applicant.The experiments show the accuracy and recall rate of the repairing algorithm.(4) This paper considers how to get query result with quality guarantee, for the data still hard to repair. Uncertain data can be used to model inconsistency, to interpret query result. In this paper, we study a complex case, locally correlated inconsistent spatial data,investigate the way to provide the probabilistic frequent nearest neighbor query result with quality guarantee. The key problem is that it will take too high computational cost to process such kind of query with existing processing algorithms dealing with traditional data and uncertain data query. To solve this problem, we propose a general framework, including a dynamic programming for computing probabilistic mass function, and a pruning criterion with threshold. Experiments on both synthesized and real data show the efficiency and effective of our algorithm.
Keywords/Search Tags:inconsistent data, consistency evaluation, query feedback, data repair, approximated functional dependency
PDF Full Text Request
Related items