Font Size: a A A

Key Techniques Of Structured Data Cleaning

Posted on:2019-05-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:S HaoFull Text:PDF
GTID:1368330590951478Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Real-world data is dirty e.g.,inconsistent,inaccurate and mis-categorized.Thus,data cleaning has played an important role in data analysis and management.Traditional data cleaning methods sometimes cannot detect error completely or repair data correctly.Moreover,majority of them cannot make use of external resources e.g.,knowledge base and user,for data cleaning.In order to improve the accuracy of data cleaning and meet the challenges in the era of big data,this thesis studies the key techniques of structured data cleaning from three aspects,including constraint-based data repairing,rule-based data repairing and user guided data repairing,and designs efficient algorithms and indexing techniques.To summarize,the main contributions of this thesis are as follows:1.Eliminating violation based on constraints.At present,most of the functional dependency based cleaning algorithms have the problem of incomplete error detection.Thus,this thesis proposes a revised semantics of violation and data consistency w.r.t.a set of functional dependencies.The revised semantics relies on string similarities,in contrast to traditional methods that use syntactic error detection based on string equality.Along with the revised semantics,this thesis also proposes a new cost model to quantify the cost of data repair,and proves that finding minimum-cost repairs in the new model is NP-hard.Thus,expansion-based algorithms and greedy algorithms are designed to find optimal and approximate repairs respectively.In addition,this thesis develops indices and optimization techniques to improve the efficiency.Experiments show that this approach provides a significant change for better detecting and repairing errors,which in turn improve both precision and recall.2.Cleaning relations using knowledge bases.This thesis studies the problem of detecting and repairing erroneous data,as well as marking correct data,using well curated knowledge bases.This thesis proposes a new type of data cleaning rules that can make actionable decisions on relational data,by building connections between a relation and a knowledge base.This thesis gives a formal definition of the rule and studies fundamental problems associated with this rule,e.g.,rule consistency and rule implication.This thesis also presents efficient algorithms to apply rules to clean a relation,and discusses approaches on how to generate rules from examples.Extensive experiments,using both real-world and synthetic datasets,verify the effectiveness and efficiency of applying this rules in practice.3.Human-in-the-loop data repairing.This thesis studies the problem of discov-ering mis-categorized entities from a given group of entities by a hybrid human-machine method.This thesis proposes a novel rule-based framework to solve this problem.It first uses positive rules to compute disjoint partitions of entities,where the partition with the largest size is taken as the correctly categorized partition,namely the pivot parti-tion.It then uses negative rules to identify mis-categorized entities in other partitions that are dissimilar to the entities in the pivot partition.Then,it is up to the user to decide which ones to remove from their own groups.This thesis describes a signature-based "filtering-and-verification" framework to applying these rules,and discusses how to generate positive/negative rules from user's decision.Extensive experimental results on real-world datasets show the effectiveness of our solution.
Keywords/Search Tags:data violation, data repairing, functional dependency, data cleaning rule, human-machine method
PDF Full Text Request
Related items