| Clustering is to classify data sets according to the internal characteristics of data without prior knowledge such as class labels,aiming to find new structures,new properties and new relationships.With the development of information technology,clustering analysis is applied in more and more fields.At the same time,the complexity of data is also increasing(the shapes of subsets in the data set are diverse,the sizes of classes in the data set vary greatly,and the densities of data vary greatly),so that many existing clustering algorithms have poor clustering effect on these data sets.The main reasons are as follows: 1)It is difficult to determine the number of clusters;2)Algorithms are sensitive to parameters;3)The influence of noise points on clustering results;4)There are many influencing factors that need to be considered for arbitrary shape,imbalanced clusters size and unbalanced density distribution data.To solve the above problems,we design a clustering algorithm that automatically obtains the number of clusters,an imbalanced clustering algorithm without parameters,and an adaptive multi-center clustering algorithm for arbitrarily distributed data.The main research results and innovations of this dissertation are as follows:1.For most existing clustering algorithms need to know the number of clusters in advance,the density peak clustering algorithm[1]gives a new scheme,in which users select the cluster centers according to the position of data points on the two-dimensional decision graph.Although this algorithm gives some criteria for selecting the cluster centers,it requires users to judge and select cluster center points according to their own experience and different user selections will produce different clustering results.Therefore,we propose a density-peak-based clustering algorithm to automatically determine the number of clusters.Firstly,the method of selecting initial cluster centers is designed.Then,the remaining data points are allocated to obtain the initial clusters.Inspired by the scale space theory,the initial cluster centers are merged,and the number of clusters merged is counted.This is repeated until they are merged into one cluster.Finally,the number of clusters that remain unchanged for the most times is determined as the final number of clusters,and the corresponding clusters are the final clustering result.The algorithm can automatically obtain the number of clusters and eliminate the influence of noise points.Experiments show that the algorithm has good clustering effect on both convex data sets or non-convex data sets.2.In order to solve the problem that the existing clustering algorithms are easy to regardsmall clusters as noise points or mistakenly allocate the data points in large clusters to small clusters in the process of clustering imbalanced data sets,we design a clustering algorithm based on local-density peaks for imbalanced data without parameters.Aiming at the problem that the density peaks clustering algorithm needs to specify the distance threshold,a new adaptive method to determine the distance threshold is proposed,and a new local density calculation method is proposed for imbalanced data sets.Then,a three-dimensional decision graph is designed to better distinguish noise points and the centers of small clusters,and solve the problem that small clusters are regarded as noise points.On this basis,firstly,an initial subcluster construction scheme is designed,which automatically generates the initial subclusters.Secondly,a subcluster updating strategy is proposed,which can identify and remove the false subcluster centers.The subclusters generated prevent the problem that the data points of large clusters are incorrectly allocated to small clusters in the clustering process.Thirdly,a subcluster merging scheme is designed,which can automatically merge the subclusters updated to form the final clustering result.Experiments show that compared with similar algorithms,the algorithm has good clustering effect on both imbalanced data sets and balanced data sets,and the time cost is significantly reduced.3.In view of the poor clustering effect of existing algorithms on data sets with arbitrary shape and nonuniform density distribution,we propose a multi-center clustering algorithm based on mutual nearest neighbors.The algorithm uses multiple centers to represent a cluster in order to effectively cluster arbitrarily distributed data.Firstly,we design a center point discovery scheme based on mutual nearest neighbors,which can find the center points adaptively without any parameters.Because the center points are found according to their mutual nearest neighbors,which is independent of the distance and density between data points,the algorithm is suitable for data sets with nonuniform density distribution.Then,a subcluster construction scheme based on the connection of center points is designed.The scheme constructs subclusters by connecting multiple center points in the adjacent area to form the maximum connection of the center points.Therefore,the algorithm is effective for the clustering of nonconvex data sets.Finally,we calculate the difficulty of merging subclusters according to the degree of overlap between subclusters and the distance between subclusters.According to the difficulty of merging subclusters,we design an algorithm to determine the number of clusters.The number of clusters with the greatest change in difficult degree of merging is the final number of clusters,and the corresponding clusteringresult is the final clustering result.Compared with the comparison algorithms,the algorithm automatically obtains the clustering center points by using the mutual nearest neighbors,and does not need any parameters.It can effectively cluster the arbitrarily distributed data sets. |