Research On Micro Inconsistencies Of Data

Posted on:2021-01-15

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J Z Sun

Full Text:PDF

GTID:1368330614950650

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Recently,with the rapid growth of data size,data quality issues have become an important research direction in the database research area.Inconsistency is one of the most important aspects of data quality problems.Data quality rules are important tools to deal with data inconsistencies.In order to detect and repair inconsistent data as much as possible,various kinds of constraining rules were proposed,including functional dependencies,conditional functional dependencies,editing rules,repair rules,etc.Most of the proposed rules describe the macro inconsistencies of data,claiming that a tuple's values on several attributes can determine other attribute values of the tuple.To the best of our knowledge,all existing rules consider each attribute value as an inseparable atom,even though a micro-part of a value can determine another attribute's value in many real-world applications.To describe this type of micro information,we propose a new kind of data quality rules named micro functional dependencies(micro FDs),which can help in dealing with micro inconsistencies in data.Similar to existing studies,we focus on several issues including grammatical and semantical analysis,the rule discovery problem,inconsistency detecting and repairing related to micro inconsistencies in data.First,we formally define micro FDs grammatically and semantically by involving extracting functions to capture micro inconsistencies in data.Several basic properties of micro FDs are studied,including satisfiability analysis,implication analysis and inference system.Satisfiability analysis focuses on determining whether a set of micro FDs itself is consistent.Implication analysis focuses on determining whether a set of micro FDs itself is redundant.The inference system consists of several inference rules corresponding to the implication analysis.We prove that the determining problem of satisfiability is NPComplete,the determining problem of implication is Co NP-Complete,and finally a sound and complete inference system is given.Secondly,we study the problem of discovering Micro FDs.Because almost all micro-information is extracted from a string-type attribute,we consider discovering dependencies with regular expressions and locations on the left hand.We cluster and align strings according to the similarity between two characters,then conclude them into the form of regulation expressions.Both clustering and aligning are challenging and play crucial roles in the discovering task.A greedy merging method in a bottom-up manneris employed to do the clustering and aligning works simultaneously.Due to the greedy algorithm's inefficiency,several pruning strategies are proposed to reduce the running time,and some significant theoretical results are attained.In the experimental study,our methods' effectiveness and efficiency are verified on three real data sets as well as several synthetical data sets.Then we study the problem of detecting single-source tuples that violate multiple dependencies.This thesis investigates the inconsistent detection problem of on-disk data,based on data clustering.The data clustering algorithm is usually implemented by sorting or hashing.When the data size is very large,both sorting and hashing require multiple read and write operations of the data.To reduce the number of IO operations,the characteristics of micro FDs can be exploited to share intermediate results when detecting violations under multiple constraining rules.According to the different applicable conditions,three sharing technologies are given for the detection tasks in the case of two dependencies.We prove that when there are many dependencies,the scheduling problem of multiple detecting tasks is NP-complete,and a greedy heuristic scheduling algorithm with approximation guarantee is given.Experiments show that the sharing technology proposed in this thesis can save much IO cost and improve the efficiency of detecting inconsistent data greatly.Finally,we study the problem of repairing micro inconsistencies in data.We focus on macro inconsistencies between data sources and micro inconsistencies in a single source simultaneously.The former corresponds to truth discovery methods,and the latter corresponds to rule-based data repairing.The two problems above may coexist,considering them together can provide some benefits,and to the author's best knowledge,this has not yet been considered in any previous studies.In this thesis,a schema-decomposing based method is proposed to simultaneously discover the truth and clean the data,with the aim of improving data consistency.Experimental results on real-world data sets of laptops and mobile phones,as well as simulated data sets demonstrate the effectiveness and efficiency of our proposed method.

Keywords/Search Tags:

Micro Functional Dependency, Discovery of Micro Functional Dependencies, Inconsistent Data Detection, Truth Discovery, Rule-based Repairing

PDF Full Text Request

Related items

1	Functional Dependencies Discovery Based On Sampling
2	Algorithms Of Copy Detection And Truth Discovery For Multi-relational Data
3	Functional Dependencies For XML
4	Research And Implementation Of Distributed And Parallel Algorithm For Large-Scale Functional Dependency Discovery
5	Spark-based Distributed Functional Dependency Discovery Algorithm
6	Research On Truth Discovery Based On Bayes Model In Web Data Integeration
7	Research On General Parallelization Strategy For Functional Dependency Discovery Algorithms
8	Study On Mining Algorithms Of Batch Processing Patterns Based On Functional Dependency
9	Research On Algorithm Of Truth Discovery Based On Estimation Of Statistical Feature Parameters
10	Key Techniques Of Structured Data Cleaning