Font Size: a A A

Effective Rule-based Algorithms For Data Cleaning

Posted on:2022-05-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:HIBA ABU AHMAD(AB)Full Text:PDF
GTID:1488306569986929Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The real world is facing exponential growth of big data.Many developing technologies can provide us with any kind of information,but until now,there is a big challenge to get the implicit worthy knowledge.This is due to the poor real-world data quality,which seriously messes up the data mining and analysis process causing unreliable results.Data quality is damaged by different noise sources,such as deficiency of information extractors,inaccuracy of data generators,and heterogeneity of data representations in different data sources.This produces dirty data,which cost various institutions and enterprises billions of dollars yearly.Thus,data cleaning is an essential step in the knowledge data discovery to address data quality problems.Data cleaning is a two-steps process of uncovering data errors,and correcting these errors,to make the data comply with a set of rules,i.e.,data quality rules.Rule-based data cleaning technology is a crucial method in which data cleaning rules play a significant role in data quality promotion.Data quality rule set is a declarative method to determine valid or correct data values.Thus,a violation is any data instance that does not match these specified rules.Data quality rules should uncover semantic errors that are more complicated than syntactic errors,and preferably,correct these errors.However,data quality rules with more expressive language are more complex to discover and more challenging to employ automatically in data cleaning methods.This dissertation proposes three novel rule discovery algorithms for data cleaning that improve entity resolution,error detection,and data repairing as substantial data quality problems.The specific innovations are as follows.Entity resolution is an important task in data cleaning to detect different data items that belong to the same real-world entity.It has a critical impact on digital libraries where different entities share the same name without any identifier key.Conventional methods adopt similarity measures and clustering techniques to reveal the records of a specific entity.They suppose that the similarity is larger between the records that belong to the same entity than others,which is not hold by all records.Due to the lack of performance,recent methods build rules on records' attributes with distinct values for entities.However,they use inadequate attributes and ignore common and empty attribute values,which affect the quality of entity resolution.To solve the problem,this thesis defines a multi-attributes weighted rule system(MAWR)that investigates all values of records' attributes in order to represent the difficult record-entity mapping.Then,it proposes an effective rule generation algorithm based on this system.The thesis also proposes an entity resolution algorithm(MAWR-ER)depending on the generated rules that effectively and efficiently identify entities of the data sets.The experimental results on real-life data prove the effectiveness and efficiency of our proposed method with better performance and more robustness than the state-of-the-art entity resolution methods.Data repairing is also a key problem in data cleaning which aims to expose and rectify data errors.Traditional methods depend on data dependencies to check the existence of errors in data,but they fail to rectify the errors to determine wrong values,and worse,they cannot fix the wrong values.To overcome this limitation,recent methods define repairing rules on which they depend to detect and fix errors.However,all existing data repairing rules are provided by experts,which is an expensive task in time and effort.Besides,rule-based data repairing methods need an external verified data source or manual verification;otherwise,they are incomplete where they can repair only a small number of errors.Therefore,this thesis defines weighted matching rectifying rules(WMRRs)based on similarity matching to capture more errors.Depending neither on humans nor on reliable external data is practical for rule discovery since human efforts are costly and reliable external data sources are not always available.Hence,this thesis proposes a novel algorithm to discover WMRRs automatically from dirty data in-hand.It also develops an automatic algorithm for rule inconsistency resolution,in contrast to the existing data repairing rules where experts are required to resolve the inconsistency.Additionally,based on WMRRs,this thesis proposes an automatic data repairing algorithm(WMRRDR),which uncovers a large number of errors and rectifies them dependably.Our proposed method performs reliable and accurate data repairing full-automatically,based on the data in-hand without master data or user verifications.It achieves higher repairing recall without any loss of repairing precision.The experimental results on both real-life and synthetic data prove that our method can discover effective WMRRs from dirty data in-hand and perform dependable and full-automatic repairing based on the discovered WMRRs,with higher accuracy than the existing dependable methods.Data sampling is a major data reduction that elects a representative data sample with a feasible size for processing from the entire data,which is very beneficial to speed up big data analysis.In data repairing context,sampling has been proposed as an approximation technique for fast rule discovery from large data sets by trading off the accuracy against the efficiency.Weighted matching rectifying rules can achieve highly accurate repairing.However,they require scanning the whole data set to discover an extensive rule set for repairing,which is time-consuming for interactive applications.For large-scale data,this thesis introduces a sampling-based rule discovery approach for approximate weighted matching rectifying rules.This thesis proposes a sampling algorithm to efficiently extract a suitable random sample with high-availability items that fit discovering approximate weighted matching rectifying rules.It also proposes an approximate rule-based data repairing framework where approximate rules are efficiently discovered from the generated sample;then,the consistent approximate rules effectively detect and repair errors from the entire data.Accordingly,partial data repairing is accomplished reliably and efficiently by accurately fixing a tolerable part of data errors.Although the proposed method in this thesis sacrifices the repairing completeness to a certain extent,it maintains the correctness of repairing and considerably improves the efficiency of repairing.Our method reduces the ratio of errors by partly reliable data repairing to cope with the growing size of modern data sets.The comprehensive experimental results verify the efficiency of our proposed method and prove a good performance of approximate rules on data repairing.
Keywords/Search Tags:Data Quality, Data Cleaning, Rule Discovery, Entity Resolution, Data Repairing
PDF Full Text Request
Related items