Font Size: a A A

Research On A Common Method For The Unsupervised Data Cleaning

Posted on:2020-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:P LiFull Text:PDF
GTID:2518306548992909Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
In order to monitor or obtain the operation status of equipments,a variety of sensors are widely used in various fields,which becomes one of the important signs of the Internet of things society.When sensors collect or store data,due to the equipment failure,electromagnetic interference,environmental change or other reasons,many kinds of data quality problems can appear in the collected data inevitably,such as null value,unintelligible codes or other wrong data that violates the attribute value constraint.Because it is difficult to get the real state of equipments when the fault occurs,and there is always no significant correlation between the data collected by different sensors,it is difficult for business personnels to repair them directly by specifying business rules.In addition,these error data may exist in other data sets,which is called domain-independent error data in this paper.For these above problems,this paper studies the characteristics of domainindependent error data,and proposes a common data cleaning framework to solve them.The main contributions of this paper are summarized as follows:(1)According to the participation level of business personnels,the data cleaning process is divided into three different ways: supervised,semi-supervised and unsupervised,and their mathematical descriptions are given.Due to the lack of enough domain knowledge,the reparation of domain-independent error data in data sets is essentially a non-intervention and unsupervised way.(2)For the domain-independent error data in unsupervised data cleaning,this paper propose an attribute correlation-based framework under blocking(ACB-Framework)to repair them.It adopts the idea of machine learning to learn the correlation in a data set,and selects 2n+1 closest tuples to repair according to the learned attribute correlation.The experiments show that this framework is effective for the domain-independent error data and is a common method which can be applied to different error types.(3)In order to reduce the time cost of the framework,this paper proposes three data blocking methods with different clustering accuracy,and analyzes their convergence and time complexity.Moreover,this paper discusses the influence of clustering accuracy on the repair ability of the framework in the experimental part.In summary,thanks to the blocking methods,the ACB-Framework can reduce the corresponding time cost although its repair ability reduce too.Because of its unsupervised character in the repair process,it can be applied in information systems requiring rapid response and provide some reference value for the data cleaning in other fields.
Keywords/Search Tags:Data quality, Unsupervised data cleaning, Attribute correlation, Data blocking, Machine learning
PDF Full Text Request
Related items