Font Size: a A A

Density Peaks Clustering Algorithm For High Dimensional And Dynamic Data

Posted on:2020-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:R T DiFull Text:PDF
GTID:2428330575959484Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,a large amount of data has been generated.For the sake of obtaining useful information from these massive data,data mining technology came into being.Clustering refers to dividing data into different clusters by measuring similarity or dissimilarity between data without any prior knowledge or with a small amount of prior knowledge.As a kind of data mining technology which is basic and critical,clustering has been widely used in security,text mining,image processing,bioinformatics,business intelligence and other fields.Moreover,the clustering research has lasted for a long time and has always attracted great attention.Density peak clustering(DPC)is a new clustering method proposed by Science in June2014.It has the advantages of simple and efficient,no iteration,fast discovery of clustering centers,and efficient allocation of sample points.It provides a new opportunity for processing massive heterogeneous data,so it has been recognized and widely used.Although DPC performs well in many classical datasets and real-world scenarios,it still needs to be improved in high-dimensional and dynamic datasets.On the one hand,for dynamic data,how to cluster new data effectively on the basis of initial data clustering is still a problem to be studied by DPC.On the other hand,the dataset with high dimension will lead to dimension disaster,coupled with the characteristics of sparse,so that DPC can not get better clustering effect.In response to the above problems,this paper proposes a DPC algorithm for high-dimensional dynamic data.The main work of this paper is as follows:1.A density peaks clustering method on dynamic data is proposed.Because the data is generated dynamically,it is impossible to get all the data at one time,so the DPC process for dynamic data includes two processes: clustering the initial data and clustering the new data.Because the clustering result of the initial data will directly affect the clustering result of the new data,so this paper firstly proposed a peak density clustering algorithm based on shared nearest neighbor optimization(IDPC)to overcome the shortcomings of the traditional DPC algorithm,this algorithm will get better effect for initial data clustering.Then,for the newly added data,a Increment dynamic clustering of density peaks based on shared nearest neighbors optimization is proposed(SNN-DPC).2.Density peaks clustering algorithm on high-dimensional and dynamic data is proposed.Density peaks clustering method on dynamic data is poor performance on high-dimensional data,this paper proposed Stacking Autoencoder to reduce the dimension,and then use density peaks clustering method on dynamic data to cluster the data.Compared stacking autoencoder with PCA,LLE dimension reduction methods,the effectiveness of the stacking autoencoder dimension reduction method is verified.Taking AMI,ARI and Acc as evaluation indexes,compared with DPC,KNN-DPC and DBSCAN,and the effectiveness of the proposed density peaks clustering algorithm on high-dimensional and dynamic data is identified.3.A false comment recognition method based on high-dimensional dynamic density peaks clustering was proposed.Online false comments have the characteristics of high-dimensional and dynamic.In view of the high-dimensional characteristics of false comments,this paper first constructed a multi-dimensional features model of false comments.Secondly,in order to describe the dynamics of water army in the time dimension more effectively,we proposed the feature of false comments in the time dimension based on KNN.According to the analysis,we extracted 66 high-dimensional features of false comments.Finally,the comments are divided into two datasets in chronological order,which are used as the initial dataset and the new dataset respectively,and clustered the dataset by the density peaks clustering method on high-dimensional and dynamic data proposed in the previous section.Through the spatial characteristics of the distribution of false comments and real comments,the false comment cluster is identified.
Keywords/Search Tags:High dimensional and dynamic data, stacking Autoencoder, density peaks clustering, shared nearest neighbors, false comment recognition
PDF Full Text Request
Related items