With the continuous development of science and technology,data generated in daily life has been increased exponentially.How to effectively analyze the data and mine hidden information is very important.Cluster analysis is a common tool in data mining,which belongs to unsupervised machine learning.The purpose of cluster analysis is to divide the data points into multiple clusters composed of similar objects.Data points belonging to the same cluster have great similarity and data points belonging to different clusters have great differences.Domestic and foreign experts and scholars have proposed a variety of clustering algorithms for different situations,which is a hot research field.Because the density based clustering method can deal with arbitrarily shaped clusters,automatically discover the number of clusters,deal with noise points and outliers,and divide data into different clusters with only a small amount of domain knowledge,it has attracted more and more attention and is widely used in the field of data mining and pattern recognition.This paper studies the density based clustering algorithm.This kind of algorithm has the disadvantages of error propagation in the process of label propagation strategy.This paper improves this kind of algorithm,and applies it to text clustering.The main research contents of this paper are as follows:(1)In view of the problems of Dynamic graph-based label propagation for density peaks clustering algorithm,such as manual parameter setting,error expansion in the process of label propagation and deterioration of clustering effect with the increase of iteration times,a density peak clustering label propagation algorithm based on nearest neighbor is proposed.The proposed algorithm adopts nearest neighbor label propagation and fully considers the structural information between data points.At the same time,the algorithm constantly updates the data state in the label allocation stage,so as to ensure that more information is captured to improve the allocation accuracy.In addition,there is no need for iteration.Even if there are error points,it will not affect the clustering results,which improves the robustness of the algorithm.The experimental results show that the proposed algorithm has good performance and robustness.It can achieve good results in the case of local and nonlinear clustering and deal with complex data such as manifolds datasets.(2)Aiming at the problem of "chain error" when Local gap density for clustering high dimensional data with varying densities algorithm of variable density high-dimensional data is assigned in the form of chain,a local gap density clustering algorithm with weighted nearest neighbor assignment is proposed.The algorithm firstly adopts something like semi-supervised learning and uses the obtained clustering information to calculate the probability.Then,the algorithm assigns unassigned points to the most likely clusters.In order to ensure the effectiveness of the distribution probability,the data status is constantly updated during the distribution,and the correlation between data points is fully considered to avoid "chain error".Only one sample label is assigned at a time,and the information of the assigned clusters in the nearest neighbor is fully used.Even if there are error points,the clustering results will not be affected,which improves the robustness of the algorithm.Experimental results show that the algorithm has good performance and robustness on artificial datasets and real datasets,and can deal with complex data such as manifold and nonlinearity datasets.(3)By applying the two algorithms proposed in this paper to text data,the effect of the algorithm in text clustering is verified.Firstly,the text dataset is obtained on the network,and its format and contents are simply adjusted.Next,the "Jieba" word segmentation tool based on Python platform is used to cut words and filter stopwords of the text dataset.Finally,it is divided into text vector directly obtained by word frequency inverse document frequency and text vector obtained by PCA dimensionality reduction.Text clustering is carried out on the two clustering algorithms,and the clustering results are saved accordingly.The experimental results show that both algorithms can be used in text clustering.The weighted nearest neighbor distribution of local gap density clustering is more suitable for non sparse data,and nearest neighbor label propagation for density peak clustering algorithm is more suitable for sparse data. |