| Data provide us with a large amount of information,but also reflect the massive information features,which make it difficult for us to accurately extract effective features.Most researchers are eager to dig deeper value of massive data and make better use of these massive data.Cluster analysis has become one of the important research directions of data mining through data processing.It mines the hidden information and knowledge in the data by processing a large amount of data.However,although the traditional clustering algorithm is simple and feasible,it is difficult to adapt to the actual demand when dealing with the data of a large scale.Therefore,when the amount of data increases dramatically,how to choose the appropriate data preprocessing method has become an important research field of data processing.Data preprocessing algorithms can normalize different types of data.This paper mainly focuses on the research of Canopy data preprocessing clustering algorithm and incomplete data preprocessing clustering algorithm.Firstly,a new spectral clustering algorithm based on variance Canopy is proposed to solve the poor convergence effect of traditional clustering algorithm in non-convex spherical sample space and the randomness of threshold setting of Canopy clustering algorithm.First of all,the data set obtained in the first stage of the spectral clustering algorithm is denoised,and the sample object with the minimum variance is selected as the first Canopy set center.Secondly,the remaining sample objects were divided by calculating the mean distance,and all the sample objects in the sample data were divided by the iterative operation,and the selection of clustering centers was further optimized by calculating the cluster standard deviation.Finally,the obtained cluster centers and the number of clusters are substituted into the second stage of the spectral clustering algorithm to obtain the clustering results.The comparative experiments on UCI data sets verify that the clustering effect of the algorithm is better on the nonconvex spherical sample space.Secondly,aiming at the problem of incomplete data in real life,and considering the division of boundary sample objects,a three-way Canopy clustering algorithm based on incomplete data is proposed.First,the incomplete data is filled with k-nearest neighbor imputation method,and the missing attribute values of the incomplete data are filled by the average value of the neighbor attribute values to obtain the complete data set.Next,the filled data set is clustered by the variance Canopy clustering algorithm to obtain the cluster center and the number of clusters.Finally,the threeway clustering method is introduced to calculate the dispersion and subordinate factor according to the cluster distribution,and the sample objects in the boundary part of the cluster are divided into the edge domain,and the sample objects in the core domain and the edge domain are obtained.The experimental results show that the algorithm has good clustering effect and high accuracy when dealing with incomplete data. |