| With the rapid development of information technology,the amount of information has exploded and huge amounts of data have been produced.In order to extract valuable information from massive data,data mining technology comes into being.Clustering analysis technology is an important task in the field of data mining and it has a wide range of applications in all walks of life.Although clustering analysis technology has made considerable progress in recent years,how to synthesize the advantages of various types of clustering ideas and put forward more excellent clustering algorithms is still a research hotspot.Clustering by fast search and find of density peaks is a new clustering algorithm proposed in the Science magazine in 2014,which combines the idea of density and partition.The algorithm not only has a novel idea,but also has good clustering ability.Through in-depth study and analysis of this algorithm,a potential-based clustering method with hierarchical optimization is proposed in this thesis which draws on the advantages of the algorithm and improves its existing deficiencies.The traditional density model is sensitive to the neighborhood radius,and in the calculation process,only the local data objects are considered,which often results in general effect.The improved algorithm introduces the potential field model in the first stage which realizes the accurate description of the data objects by using the overall distribution information of the dataset.And the construction of the edge-weighted tree based on potential energy can optimize the original allocation strategy.In addition,the distribution characteristics of the dataset are fully considered in the calculation of the decision value,and therefore the weights of the parameters are automatically determined by the discrete degree.Combined with the idea of normal distribution on this basis,a positive strategy is adopted that all the data points with decision value larger than the upper limit of the confidence interval are selected as potential cluster centers to obtain multiple initial sub-clusters.The original algorithm is limited by its clustering principle so that it is often difficult to identify sparse clusters and evenly distributed clusters.In the second stage of the improved algorithm,inspired by the idea of hierarchical clustering,a series of cluster merging criteria based on potential energy are proposed in this thesis.The initial sub-clusters generated in the first stage are gradually merged by comparing the cluster average potential energy and their border potential energy to get the final clustering result.Through the hierarchical optimization in this stage,the algorithm can automatically stop the clustering process without specifying the number of clusters in advance,besides,it has good recognition ability for clusters of arbitrary shape,distribution,size and density.Experiments on two-dimensional and multi-dimensional datasets show that compared with other algorithms,the improved algorithm has significantly higher clustering quality,higher stability,stronger cluster recognition ability and a certain improvement in the process effect of high-dimensional datasets,while the overall clustering time has no apparent growth,which proves the superiority of the proposed algorithm.In addition,the improved algorithm is well applied in the reader subdivision of publishing media enterprises,which also reflects the effectiveness of the proposed algorithm in solving practical problems. |