Font Size: a A A

Manifold Density Peak Clustering Algorithm And Its Application Of Weibo Text Classification

Posted on:2019-11-20Degree:MasterType:Thesis
Country:ChinaCandidate:Q F ZhuFull Text:PDF
GTID:2428330548481386Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Clustering analysis,as a basic and important method in data mining,has been widely concerned.With the development of data mining technology and the emergence of big data,clustering analysis has also developed rapidly.In 2014,Alex Rodriguez et al.proposed a new density-based clustering algorithm in Science which is caled Clustering by fast search and find of density peaks,abbreviated as Density Peaks Clustering(DPC).DPC algorithm is novel,concise and efficient,and mainly has the following advantages: there is no need to pre-specify the final number of classes;there is only one input parameter and it is insensitive to the parameter;it can identify non-spherical classes;data distribution is completed in one step which is conducive to processing large-scale data;it is able to identify the noise points.However,DPC algorithm also has some defects,mainly including: how to use better methods to obtain optimal parameters;it needs to select cluster centers manually,which is highly subjective;the distribution mechanism of data points is insufficient for some manifold data sets,and so on.The above defects undoubtedly limit the promotion and application of the DPC algorithm.In this paper,we propose some related improved algorithms for the inability of DPC algorithm to handle manifold data sets.The main work and research results of this paper are as follows:(1)A density peak clustering algorithm optimized by K-nearest neighbor similarity optimization is proposed to solve the problem that density peak clustering does not apply to manifold clustering(such as Circleblock datasets,Lineblobs datasets,etc.),which is caused by only considering the distance between the sample point and the pointing point(the closest point with a density greater than it)when assigning the density peak clustering.In manifold clustering,points that are close to high density clusters are assigned preferentially,which causes error propagation and eventually results in erroneous results.This is the main reason for DPC algorithm error in manifold clustering.After calculating the density and the pointing point of each point,the similarity function proposed in this paper is used to calculate the similarity between points and find the K nearest neighbors of each point.According to the K nearest neighbor information,it is judged whether the pointing point of the sample point is correct or not.It can effectively reduce the error allocation to re-find the correct pointing point for the wrong pointing point.Theoretical analysis and experiments show that the new algorithm has a higher accuracy.(2)The manifold density peak clustering optimized by the fast density feature map is proposed.Because the Euclidean distance between the sample point and the wrong-pointing point is close,the sample point is pointing to the error point,and the error propagation is caused.Finally,the wrong clustering result is obtained.By replacing the Euclidean distance with the manifold distance optimized by the fast feature map,the similarities between different classes of points can be better reflected.The algorithm first constructs an undirected feature map by finding feature points.Then for any two points,the manifold distance between them is calculated through the undirected feature map.Finally,the distribution is completed according to the manifold distance.Replacing Euclidean distances with manifold distances optimized by the fast feature map can expand the distance between different classes of points,so the algorithm can finally obtain correct results.Finally,the above two improved algorithms are compared and analyzed to find out their advantages and disadvantages.(3)The above two improved algorithms are applied to Weibo text clustering.First of all,the Weibo texts are preprocessed to remove some meaningless characters.Then segment the microblog texts and remove the stop words.For the Weibo texts that have been processed,the feature words are searched using the information gain method.Then TF-IDF calculation is performed on the text content and an N*M matrix(N document and M feature words)is constructed.The above two improved clustering algorithms and the original DPC algorithm are used to cluster the Weibo texts.Finally,the results of the clustering are compared to prove the effectiveness and superiority of the improved algorithm in this paper.
Keywords/Search Tags:density peak clustering, cluster center, K nearest neighbor, similarity, manifold distance, text clustering
PDF Full Text Request
Related items