Font Size: a A A

Research On Density Peak-based Clustering Algorithm And Its Parallel Implementation

Posted on:2021-07-17Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LiuFull Text:PDF
GTID:2518306032959569Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In recent years,with the explosive growth of data,artificial intelligence and machine learning develop rapidly.As a tool of knowledge discovery,data mining has attracted more and more attention.In this field,clustering analysis is a kind of technical means commonly used in data processing,which can divide the data without class label.At present,clustering analysis has been widely used in electronic commerce,image processing,web mining,biology,security and other fields.Density Peak and fast search algorithm is a new clustering algorithm.This algorithm uses the local density and relative distance between samples to cluster,which is simple in principle,efficient in implementation,and efficient on the selection of initial cluster center.In this dissertation,the idea of density peak algorithm is combined with fuzzy C-means algorithm and K-means clustering algorithm respectively.The empirical study shows that the two improved algorithms have improved the efficiencies.The main contents of this dissertation are as follows:(1)The traditional FCM algorithm is simple in principle and overcomes the shortcomings of the general clustering algorithm.But in the process of clustering,it is sensitive to the initial cluster center determined randomly and the clustering division is easily affected.To solve this problem,the density peak algorithm is used to optimize the selection of initial cluster center,and an improved fuzzy C-means algorithm based on density peak is proposed(DP-FCM).Firstly,the local density and relative distance of the samples are used to characterize the density distance index.By traversing the data samples and calculating the average density distance,the points whose density distance index is greater than the average density distance are selected as the clustering center.Secondly,the FCM algorithm is used to determine the clustering center according to the density peak.The membership-matrix is initialized by the distance from each sample data to the clustering center.The value of the objective function is calculated.The membership matrix and the new fuzzy clustering center are updated continuously until the stop condition is reached,and the final clustering result is obtained.DP-FCM algorithm overcomes the sensitivity of initial cluster center and improves the accuracy of clustering.Compared with other four clustering algorithms,the overall clustering efficiency of DP-FCM algorithm is significantly improved.(2)According to the characteristics of sparsity and high dimension of text data in data mining,combining the density peak algorithm with k-means algorithm,this dissertation proposes a K-means algorithm based on density peak and weighted distance(DPK-means).Firstly,the unstructured or semi-structured Chinese text is transformed into structured data that can be understood and processed by using Chinese word segmentation and TF-IDF algorithm.Secondly,the decision graph is depicted by using the local density and relative distance parameters in the density peak algorithm to determine the cluster center and the number of clusters.Finally,calculate the mean vector of each cluster and the weighted Euclidean distance of each data sample,and divide the data into each cluster according to the principle of proximity.Recalculate the new mean vector,repeat the iteration until the clustering center no longer changes or reaches the maximum number of iterations,and complete the K-means algorithm.In the experiment,it is found that the large-scale text data algorithm is inefficient and takes a long time,so the algorithm is parallelized on the spark platform.By comparing the clustering results of each data set,it can be found that DPK-means algorithm has significantly improved the clustering effect of Chinese text data set.By comparing the running time and acceleration ratio of different nodes,the parallel DPK means algorithm can reduce the energy consumption and improve the efficiency.
Keywords/Search Tags:Density Peak Clustering, FCM, K-means, Clustering, Text Clustering, Spark
PDF Full Text Request
Related items