Font Size: a A A

Research And Application Of Clustering Algorithm Based On Density Peak

Posted on:2022-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:J C YangFull Text:PDF
GTID:2518306485494624Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the popularity of the Internet in daily life,the speed of information dissemination is accelerating,and a large amount of data information has been produced in all walks of life.How to dig out the valuable information from the complicated data has become an urgent problem.Clustering,as an unsupervised machine learning method,is often used in data analysis and data mining.Up to now,domestic and foreign scholars have proposed a variety of clustering methods for different practical application scenarios,and the related research on clustering technology is still booming.In daily life,clustering technology has been successfully applied in many fields,including customer segmentation,target recognition,natural language processing,image retrieval,biology and security,etc.This thesis studies the clustering algorithm based on the peak density(Density Peaks Clustering,DPC),and analyzes its advantages and disadvantages in detail,then proposes a new clustering algorithm,which is applied to the news text clustering.The main contents of this thesis are as follows:(1)In order to solve the problem that the selection of cutoff distance in DPC algorithm is affected by human factors,a nearby density peak clustering algorithm based on information entropy optimization(IKDPC algorithm)is proposed.Firstly,the influence factor is introduced to determine the optimal cutoff distance according to the information entropy function image.Secondly,in order to overcome the high complexity of local density calculation in DPC algorithm,a formula of nearby local density calculation based on the idea of K-Nearest Neighbor algorithm is proposed.Finally,the number of clusters and cluster centers were selected according to the decision graph to complete the clustering.The experimental results show that IKDPC algorithm can determine the clustering centers and cluster numbers more accurately,and has better clustering effect than DPC algorithm.(2)In order to solve the problems of unstable clustering results and sensitivity to outliers caused by the random selection of initial clustering centers by traditional clustering algorithms,and to solve the cluster error propagation problem that may occur in the cluster allocation of IKDPC algorithm,the K-Means clustering algorithm based on the optimization of nearby density peak(IKDKM algorithm)was proposed by combining IKDPC algorithm with K-Means algorithm.IKDKM algorithm first uses the decision graph generated by IKDPC algorithm to determine the number of clusters and cluster centers,and then calculates the average distance of clusters in the iterative clustering process.The data objects in the cluster are divided into core points and outliers.The core points participate in the calculation of new cluster centers,and the data objects classified as outliers are reallocated by voting.The experimental results show that the IKDKM algorithm has higher clustering accuracy.(3)In order to test the practicability and effectiveness of the algorithm,the IKDKM algorithm proposed in this thesis is applied to the clustering of the news text data sets.Firstly,word segmentation and stop-word filtering are carried out on the news text dataset by using the word segmentation tool 'jieba'.Secondly,the text vector calculation method based on weighted Word2 vec is adopted to convert the news text into text vector,and the text clustering is performed on the clustering algorithm.Finally,the clustering labels are added according to the clustering results of IKDKM algorithm.The experimental results on the news text data set show that IKDKM algorithm has higher practical application.
Keywords/Search Tags:clustering, density peak, cutoff distance, nearby local density, the news text
PDF Full Text Request
Related items