Font Size: a A A

Research And Implementation Of Automatic Discovery Of Data Quality Detection Rules

Posted on:2021-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y X HuangFull Text:PDF
GTID:2428330623468147Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Data is like a product,with the concept of quality,and its quality affects the degree and reliability of target data being fully mined.In the daily production and life,the generation and processing of data may be accompanied by the generation of dirty data,which makes the analysis of data with low reliability and other characteristics.In order to detect the quality of data,certain data quality detection rules need to be configured.However,at this stage,the configuration of rules is mostly done manually by data engineers,which leads to greater workload and lower efficiency of engineers.Therefore,research on how to automatically discover data quality detection rules from data is increasingly active.Since the conditional functional dependency(CFD)expresses the specific semantic constraint relationship while expressing the association relationship between attributes,the current research on the automatic discovery of data quality detection rules mainly refers to the research on the automatic discovery of CFDs.Moreover,in practical applications,the lack of data will lead to a small number of CFDs found,and the existing researches have paid less attention to the pruning optimization problem of CFDs automatic discovery algorithm.Therefore,this paper takes the automatic discovery and implementation of data quality detection rules as the research topic,and studies the missing value filling method and the CFDs automatic discovery method.The main research contents and results are as follows:(1)Aiming at the problem that the missing values in the data set will reduce the number of CFDs found from it,a method for filling missing values and improving the accuracy of missing values based on improved affinity propagation(AP)clustering and improved k nearest neighbor(KNN)in the data preprocessing stage is proposed.This method can fill in missing values on different types of data sets,and can effectively improve the accuracy of missing value filling.(2)Aiming at the problem that using traditional CTANE algorithm to find CFDs on a data set with a large number of attributes and tuples results in long running time,a method for pruning optimization of the CTANE algorithm is proposed.Compared with the traditional CTANE algorithm,the method can effectively reduce the time for automatically discovering CFDs,and will not lose the minimum CFDs on the data set.(3)In actual application scenarios,the manual configuration of rules will cause problems such as greater workload and lower efficiency of data engineers.Therefore,based on the proposed methods,this paper designs and implements a rule management module that can run on the data quality detection platform.This module realizes the automatic discovery and unified management of data quality detection rules,reduces the workload of engineers manually configuring rules,improves work efficiency,and shortens the time interval between rules configuration and quality detection.The methods proposed in this paper have been verified by a series of comparative experiments,which can effectively improve the accuracy of missing value filling and the efficiency of CFDs discovery,and realize automatic discovery of data quality detection rules.
Keywords/Search Tags:data quality, conditional functional dependency(CFD), affinity propagation(AP) clustering, k nearest neighbor(KNN)
PDF Full Text Request
Related items