Font Size: a A A

Research On Data Cleaning Based On Clustering

Posted on:2018-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhaoFull Text:PDF
GTID:2348330536477545Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Most of the data using in Data Mining comes from the real world.These data sets have many problems,such as data missing,data redundancy,data inconsistency and so on.These problematic data are called "dirty data".The constraints of data collection conditions,the error of measurement methods,the omission of manual input and the violation of data constraints are all the reasons why there is a lot of "dirty data" in the data sets is correct.In some areas of data sets,the proportion of "dirty data" is even as high as 50%-60%,or higher.These problem data not only mean error information,more importantly,but also will affect the subsequent data mining work.It also will causes erroneous extraction patterns and biased derivation rules.We call this--"Entering dirty data,outputting dirty data too".How to deal with these dirty data becomes particularly important.Data Cleaning is to complete this work.Therefore,Data Cleaning has become one of the main research in the field of Data Preprocessing and Data Mining.This paper focuses on the Data Cleaning technology in the field of Data Mining,especially the Missing Data Cleaning.The type of Data Cleaning is analyzed in detail,including Abnormal Data Cleaning,Missing Data Cleaning,and Duplicate Records Cleaning.Among them,Missing Data Cleaning is particularly common.The traditional Missing Data Filling Method based on Clustering algorithm still has the defects of low filling accuracy and unstable filling efficiency.For that reason,this paper will research on and improve Missing Data Filling Method based on Clustering,and propose improvement strategies of distance maximization and missing data clustering.Multiple experiments show that the improved algorithm has a good effect.The main research work in this paper is as follows:(1)Firstly,this new algorithm improves the clustering method.The original fill algorithm need to enter the K value,but it is difficult to determine the K value of the optimal clustering result which directly reduces the accuracy of data filling.So far,according to the principle of the data is not in the same class,the improved algorithm uses the maximum distance between the data to determine the cluster center,which can automatically determine the K value of the clustering result.It makes the clustering results quickly achieve the best,and fills data efficiently;(2)Secondly,the process of filling algorithm is optimized,merging clustering and computing missing data similarity.Through clustering method can't cluster data set with missing data,the improved method uses missing distance calculation to cluster data by improving clustering distance function.It can cluster the records with missing values now,so as to cluster and mark the missing data simultaneously.It simplifies the original filling algorithm steps importantly;(3)Finally,in the filling process,the filling of discrete data is increased.If the missing attribute value is discrete attribute,it changes to use the mark in the class with the highest frequency to fill missing values;if the missing value attributes are numeric attributes,it still use the tag class attributes of the corresponding average to fill missing value.
Keywords/Search Tags:Data cleaning, Missing Data Imputation, K-means Imputation Algorithm, Maximum Distance
PDF Full Text Request
Related items