| With the advent of the 21 st century,Internet applications are becoming more and more popular,resulting in a large amount of information.In many research fields,this information plays a very important role in research progress,and it is necessary to dig out valuable information from massive amounts of information.Information needs to use data mining technology.This article focuses on the clustering algorithm in data mining.As an algorithm for unsupervised learning,the clustering algorithm is a method of finding the clustering structure in a data set.The characteristics of the data set are the same.The maximum similarity within a cluster and the maximum difference between different clusters,each cluster represents a different feature or similarity between data points.Clustering is a basic data analysis tool,so it has a wide range of applications in different scientific fields,and it is especially important in unsupervised learning scenarios.Clustering algorithms can be divided into hierarchical method clustering,partition method clustering,density-based clustering,etc.In practical applications,partition-based clustering algorithms are the most widely studied and applied,such as K-means,K-means++,X-means and other clustering algorithms.Although many improved partition-based clustering algorithms can be seen at present,they inevitably have the following problems: 1、the determination of the number of clusters in the algorithm 2、the selection of the initial clustering center of the algorithm 3、the search of parameters in the algorithm Excellent ability is not good.In response to the above problems,based on the research and analysis of the clustering method and performance evaluation parameters and other related basic knowledge,this paper proposes the following two clustering algorithms:(1)Completely unsupervised K-means based on weighted entropy(2)Edge stripping clustering The main research work of these two clustering algorithms is as follows:(1)The k-means algorithm is an unsupervised clustering algorithm,but the k-means algorithm is always affected by the initialization of the necessary number of clusters in advance.In order to solve the accuracy of clustering and determine the number of initial clusters,this paper proposes A K-means algorithm based on entropy theory(EK-means)is proposed.The algorithm is based on entropy theory.It constructs an information entropy for each data object as the information of each data point,and then combines the membership degree to construct a new Based on the new objective function,an unsupervised learning mode can be constructed for the k-means algorithm.In this learning mode,the k-means algorithm does not need to set the cluster initialization in advance,and can find an optimal cluster cluster in time The number and the time complexity of the Ek-means algorithm are analyzed.Finally,the proposed E-k-means method is compared with other existing clustering algorithms,and the experiment proves the effectiveness of the E-k-means clustering algorithm proposed in this paper.(2)This paper proposes a new non-parametric clustering method based on the DBSCAN(Density-Based Spatial Clustering of Applications with Noise)algorithm,Boundary-stripping(BS).This method is based on the following concept: each potential cluster consists of layers surrounding its core,where the outer layer or boundary points implicitly separate the cluster clusters.Unlike the DBSCAN algorithm,in DBSCAN,the core of the cluster is directly composed of their core.Density definition,where undiscovered core points are revealed by the gradual peeling off of boundary points.Analyzing the density of local neighborhoods can identify boundary points and associate them with inner points.Experiments show that the BS algorithm is adapted to local density and features,and can successfully separate(possibly different densities)adjacent clusters.The algorithm was tested on a large number of labeled data sets,which included high-dimensional data with deep features trained by a convolutional neural network.Experiments show that the method proposed in this paper is more competitive than other latest non-parametric methods when using a fixed parameter set. |