Font Size: a A A

Research On K-means Text Clustering Algorithm Based On Improved Density Peaks

Posted on:2019-06-17Degree:MasterType:Thesis
Country:ChinaCandidate:X T QiFull Text:PDF
GTID:2428330596965809Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Most of the information disseminated in daily life such as books,magazines,and web pages are in the form of texts.With the popularity of the Internet,the speed of information dissemination continues to accelerate,and text data has also shown explosive growth.This text data contains a lot of useful information.Therefore,the text mining technology with the text data as the research object receives more and more attention.As an important branch of text mining technology,text clustering technology has also attracted much attention.This paper elaborates the research background,significance,research status and existing problems of text clustering.It introduces in detail the theories and techniques involved in text clustering,including text preprocessing technology,text representation model construction methods,text similarity measure methods,various types of clustering algorithms and cluster evaluation criteria,etc.After in-depth research of clustering algorithm,the algorithm of text clustering is improved,which improves the effect of text clustering.The main contents of this paper include the following points:(1)A clustering by fast search and find of density peaks algorithm based on K-nearest neighbor is proposed.Aiming at the subjectiveness of the clustering by fast search and find of density peaks algorithm(DPC)when calculating the local density of the sample object,the local density is redefined by using the neighbor information,and a clustering by fast search and find of density peaks algorithm based on K-nearest neighbor(KDPC)is proposed.The algorithm makes up for the defect caused by the DPC algorithm in defining the local density.The experimental results on the artificial virtual data set and the real data set show that the algorithm can find the cluster centers of the data set and determine the number of clusters,and the accuracy rate is higher than that of the DPC algorithm.(2)A K-means algorithm based on improved density peaks is proposed.Aiming at the defects of the K-means algorithm in the initialization,a K-means algorithm based on improved density peaks(KDP-means)is proposed in combination with the proposed KDPC algorithm.The algorithm uses KDPC algorithm to determine the cluster centers and the number of clusters of the data set,making up for the shortcoming of K-means that the algorithm needs to give the number of clusters and the initial cluster centers before clustering.The experimental results on the UCI data set show that the algorithm can reduce the iteration number and iteration time of the K-means algorithm to a certain degree and improve the stability and accuracy of the K-means algorithm.(3)A Chinese text clustering system based on KDP-means algorithm is designed.A Chinese text clustering system was designed and implemented using KDP-means algorithm.The system first vectorizes the text data and extracts its main features through word segmentation,removal of stop words,and establishment of a vector space model,then uses KDP-means algorithm to cluster it,and evaluates the clustering results.This paper uses the Chinese text classification corpus from Sogou Lab to conduct experiments on the system and analyze the results according to relevant evaluation criteria.Experiments show that compared with K-means algorithm and Birch algorithm of two representative algorithms in clustering algorithm,KDP-means algorithm has higher accuracy.And KDP-means algorithm does not need to determine the initial clustering centers and the number of clusters in advance,which makes the algorithm has a higher practical application.
Keywords/Search Tags:Text clustering, K-means algorithm, Density peaks, KDP-means algorithm
PDF Full Text Request
Related items