Font Size: a A A

Research On Data Cleaning Based On Clustering Algorithm

Posted on:2020-09-28Degree:MasterType:Thesis
Country:ChinaCandidate:X K FengFull Text:PDF
GTID:2428330590479152Subject:Full-time Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,data mining has emerged as data are increasing explosively.Data mining is to acquire knowledge from data,the quality of which is very important.Due to shortcomings of manual work,net errors which,more or less,undermine quality of data,including abnormal attribute values,duplicate records and lack of data values,the reliability of information can not be guaranteed.Therefore,pre-proces-sing is important before data mining where data cleansing is the critical step.The paper focuses on data cleansing,especially cleansing of missing values.The traditional data cleansing includes cleansing of duplicate records,processing of abnomal data and missing values.Clustering is an important technology in data cleaning,while refilling of missing values based on cluster still has drawbacks such as unstability and low accuracy in refilling.In order to solve the problems mentioned above,the clustering algorithm is improved,and the efficiency of the filling algorithm is improved by means of the properties of triangular inequalities,and the method of filling discrete missing values is given.It shows that improved cluster can not only refill missing values but also operates more efficiently.What have been studied are as follows.(1)The research found that the traditional Missing Data Filling Method based on DBSCAN uses a fixed neighborhood radius for clustering,and the filling effect is not ideal under the non-uniform density data set.In response to this shortcoming,this paper improves the DBSCAN algorithm.The main idea of the improved algorithm is to use the variable neighborhood instead of the fixed neighborhood to perform the core object search,and to find the class in the data set by means of the strong connected component theory of the graph.The improved DBSCAN algorithm can adaptively adjust the neighborhood size according to the density around the data object,which not only filters the noise points,but also has higher filling accuracy under the non-uniform density data set.(2)The triangular inequality property of Euclidean distance formula is utilized to reduce the computation and comparison times of distance when calculating the similarity between missing records and data sets.Especially when the data sets are very large,the efficiency of the algorithm can be greatly improved.
Keywords/Search Tags:Missing value padding, Non-uniform density clustering, DBSCAN, Strong connected component, Triangle inequality
PDF Full Text Request
Related items