Font Size: a A A

Research And Application On Density Peaks Clustering Algorithm Based On Natural Neighbor

Posted on:2024-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:D W PengFull Text:PDF
GTID:2568307139458554Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Density Peaks Clustering(DPC)is a novel clustering algorithm proposed in recent years that can quickly cluster and identify cluster centers in a decision graph.The algorithm assumes that each cluster center has the highest density in the cluster and that all cluster centers are far apart.DPC is an efficient clustering algorithm,whose operation process is simple and straightforward.It does not require any prior knowledge,and can identify nonspherical clusters without considering the data distribution,so it is suitable for various fields.However,there is still room for improvement in some aspects of DPC.For example,the cutoff distance parameter will affect the clustering performance of DPC to a certain extent;The selection of clustering center points in the decision graph requires human decision making,which may have a certain degree of uncertainty;DPC still performs poorly when processing high-dimensional datasets.This thesis proposes two improved algorithms,one of which is applied to the news text clustering.The main research content of this thesis is as follows:Aiming at the shortcomings of DPC algorithm,which is sensitive to parameter settings and can not automatically select clustering centers,adaptive density peak clustering algorithm based on natural neighbors(ADPC-NaN)is proposed.Firstly,The algorithm obtains the local neighborhood of each data point through the natural neighbor algorithm,and can obtain the local density of each data point without any parameters,which can effectively identify clusters in sparse areas.The number of cluster centers is automatically determined through the density-weighted Canopy algorithm,and candidate cluster centers are selected on the gdescending graph,and the candidate cluster center points are used to guide the division of sub-clusters.Finally,the sub-clusters are merged using the sub-cluster merging strategy.Experiments show that the ADPC-NaN algorithm has improved performance in evaluation indices such as ACC、ARI and AMI compared with other clustering algorithms on complex datasets of different scales.Adaptive manifold-based Density Peaks Clustering algorithm based on natural neighbors(AMDPC-NaN)is proposed to solve the problem that the ADPC-NaN algorithm is not effective in clustering high-dimensional data.Isometric mapping is introduced to map dimensional datasets into lower dimension to obtain a low-dimensional manifold representation that can better reflect the global structure of the data.Then,the AMDPC-NaN algorithm is used for cluster analysis in this low-dimensional manifold representation,the algorithm uses the geodesic distance as the distance measure between sample points,defines the local density based on reverse k-nearest neighbor and geodesic distance,and consider the global structure and local structure of points comprehensively,which makes it easier to find cluster centers in manifold clusters.Redefining the similarity between samples based on their neighboring information,in order to avoid the cascading effect of assigning manifold sample points incorrectly.Experiments show that compared with the existing improved DPC algorithm based on manifold learning,AMDPC-NaN has better clustering performance in three evaluation indices such as ACC when processing high-dimensional datasets.The above two improved algorithms are applied to news text clustering.Firstly,we clean the text data to ensure its completeness and accuracy.Then,we use the "Jieba" word segmentation tool to perform accurate word segmentation and stop word filtering on the Chinese text data,removing noise from the text data.Next,we adopt the TF-IDF method in VSM(Vector Space Model)to represent the text,obtaining high-dimensional sparse text vectors.Afterwards,we use Principal Component Analysis(PCA)to perform dimensionality reduction on text vectors.Finally,text clustering is carried out on ADPC-NaN algorithm,and the clustering labels are added according to the clustering results of ADPC-NaN algorithm.Experiments show that ADPC-NaN performs better than other text clustering algorithms in four evaluation indices such as F-measure when clustering text.
Keywords/Search Tags:density peak clustering, natural neighbor, Isometric mapping, geodesic distance, text clustering
PDF Full Text Request
Related items