Research On A Common Method For The Unsupervised Data Cleaning

Posted on:2020-05-26

Degree:Master

Type:Thesis

Country:China

Candidate:P Li

Full Text:PDF

GTID:2518306548992909

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

In order to monitor or obtain the operation status of equipments,a variety of sensors are widely used in various fields,which becomes one of the important signs of the Internet of things society.When sensors collect or store data,due to the equipment failure,electromagnetic interference,environmental change or other reasons,many kinds of data quality problems can appear in the collected data inevitably,such as null value,unintelligible codes or other wrong data that violates the attribute value constraint.Because it is difficult to get the real state of equipments when the fault occurs,and there is always no significant correlation between the data collected by different sensors,it is difficult for business personnels to repair them directly by specifying business rules.In addition,these error data may exist in other data sets,which is called domain-independent error data in this paper.For these above problems,this paper studies the characteristics of domainindependent error data,and proposes a common data cleaning framework to solve them.The main contributions of this paper are summarized as follows:(1)According to the participation level of business personnels,the data cleaning process is divided into three different ways: supervised,semi-supervised and unsupervised,and their mathematical descriptions are given.Due to the lack of enough domain knowledge,the reparation of domain-independent error data in data sets is essentially a non-intervention and unsupervised way.(2)For the domain-independent error data in unsupervised data cleaning,this paper propose an attribute correlation-based framework under blocking(ACB-Framework)to repair them.It adopts the idea of machine learning to learn the correlation in a data set,and selects 2n+1 closest tuples to repair according to the learned attribute correlation.The experiments show that this framework is effective for the domain-independent error data and is a common method which can be applied to different error types.(3)In order to reduce the time cost of the framework,this paper proposes three data blocking methods with different clustering accuracy,and analyzes their convergence and time complexity.Moreover,this paper discusses the influence of clustering accuracy on the repair ability of the framework in the experimental part.In summary,thanks to the blocking methods,the ACB-Framework can reduce the corresponding time cost although its repair ability reduce too.Because of its unsupervised character in the repair process,it can be applied in information systems requiring rapid response and provide some reference value for the data cleaning in other fields.

Keywords/Search Tags:

Data quality, Unsupervised data cleaning, Attribute correlation, Data blocking, Machine learning

PDF Full Text Request

Related items

1	Research On Multi-source Heterogeneous Large Data Cleaning Technology Based On Machine Learning
2	Application Of Artificial Intelligence On Data Cleaning
3	Research On Data Cleaning Based On Science And Technology Innovation Big Data Public Platform
4	Based On Spatial-temporal Correlation Sensory Data Cleaning Research
5	Research On Data Cleaning Technology With The Design And Implementation Of Data Cleaning Framework
6	Data Quality Analysis And Optimization In Public Security Intelligence Based On ETL
7	Research Of Data Cleaning Method Based On Data Warehouse
8	Research Of Key Technology In Massive Data Cleaning
9	Key Techniques Of Structured Data Cleaning
10	Design And Implementation Of Data Preprocessing System Oriented To Data Mining