Research On Data Cleaning Based On Clustering Algorithm

Posted on:2020-09-28

Degree:Master

Type:Thesis

Country:China

Candidate:X K Feng

Full Text:PDF

GTID:2428330590479152

Subject:Full-time Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology,data mining has emerged as data are increasing explosively.Data mining is to acquire knowledge from data,the quality of which is very important.Due to shortcomings of manual work,net errors which,more or less,undermine quality of data,including abnormal attribute values,duplicate records and lack of data values,the reliability of information can not be guaranteed.Therefore,pre-proces-sing is important before data mining where data cleansing is the critical step.The paper focuses on data cleansing,especially cleansing of missing values.The traditional data cleansing includes cleansing of duplicate records,processing of abnomal data and missing values.Clustering is an important technology in data cleaning,while refilling of missing values based on cluster still has drawbacks such as unstability and low accuracy in refilling.In order to solve the problems mentioned above,the clustering algorithm is improved,and the efficiency of the filling algorithm is improved by means of the properties of triangular inequalities,and the method of filling discrete missing values is given.It shows that improved cluster can not only refill missing values but also operates more efficiently.What have been studied are as follows.(1)The research found that the traditional Missing Data Filling Method based on DBSCAN uses a fixed neighborhood radius for clustering,and the filling effect is not ideal under the non-uniform density data set.In response to this shortcoming,this paper improves the DBSCAN algorithm.The main idea of the improved algorithm is to use the variable neighborhood instead of the fixed neighborhood to perform the core object search,and to find the class in the data set by means of the strong connected component theory of the graph.The improved DBSCAN algorithm can adaptively adjust the neighborhood size according to the density around the data object,which not only filters the noise points,but also has higher filling accuracy under the non-uniform density data set.(2)The triangular inequality property of Euclidean distance formula is utilized to reduce the computation and comparison times of distance when calculating the similarity between missing records and data sets.Especially when the data sets are very large,the efficiency of the algorithm can be greatly improved.

Keywords/Search Tags:

Missing value padding, Non-uniform density clustering, DBSCAN, Strong connected component, Triangle inequality

PDF Full Text Request

Related items

1	Research On Adaptive Varied Density Clustering Algorithm Based On DBSCAN
2	An Improved Semi Supervised Clustering Of Given Density And Its Application In Lithology Identification
3	Research On Clustering Method Based On Improved DBSCAN
4	Research On Density Clustering Algorithm Based On DBSCAN For Personalized Clustering
5	Research And Application Of Clustering Algorithm Based On DBSCAN
6	Research And Application On Distributed Clustering And Incremental Clustering Based On DBSCAN
7	Theory And Practice Of Ant Clustering And Partitioning-based DBSCAN Clustering
8	Research On Adaptive Clustering Algorithm Based On DBSCAN Theory
9	Research On DBSCAN Algorithm Based On Grid And Density-ratio
10	Improvement Of Spectral Clustering Algorithm Based On Local Principal Component Analysis And Self-paced Learning