Font Size: a A A

The Research And Implementation Of A Novel Text Clustering Algorithm Based On Density Peak

Posted on:2017-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:X LanFull Text:PDF
GTID:2348330536967419Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Birds of a feather flock together,people of a mind fall into the same group.Nowdays,with the growing size of data sets in the internet,clustering has been widely applied in digital image processing,species category analysis,social community detection,information security monitoring,enterprise intelligent decision and text data mining.As the arrival of the ”Internet +” and the era of the Big Data,traditional clustering algorithms,especially K-means,urgently need to make necessary development in the speed of convergence and the quality of clustering results.By introducing the concept of density peak,this thesis mainly focus on fast clustering techniques based on improving the selection of initial clustering centers,aiming to enhance the quality of clustering results by predicting the correct number of cluster based on detecting multi-density peaks,and then apply it in text processing.The main work are as follows:(1)A fast clustering algorithm based on recognizing the density peaks as initial centers(CIPD)is proposedFirstly,we analyze the problem of randomly selection of the initial cluster centers may lead to obtain the global optimal solution,unstable results and even get slow convergence.Then,based on the assumption of the cluster center has a high density and the distance between each other is far,we put forward an index R which indicates the possibility of a data point to become a cluster center,and design a cluster initial center selection method(PD).Combining with PD and K-means,we develope a fast clustering method called CIPD.UCI test results in four data sets show that our method CIPD has higher accuracy and faster convergence comparing with other clustering methods.(2)An automatic cluster number selection method by finding multi-density peaks is proposed.The research has been found the number of clusters has a close relationship with the number of density peaks.Based on this fact,this thesis presents an automatic cluster number selection method called CNSFDP by means of detecting multi-density peaks.Firstly,we design an index called CS which is closely related with the density peak.Next,we order CS by its value from high to low and plot a figure to describe the relationship between CS and the number of cluster.As a result,we will see these points form a curve which has a significant turning point.Finally,the least squares method is introduced to find the turing point in this curve,and returns its value as the number of clusters.Compared with other algorithms,CNSFDP can be applied to cluster the data set with the complex distribution,such as the shape of a concave,a ring or a mixture of complex data distribution,due to its lowly demand for data distribution.Test results on six public UCI datasets show that,compared with other methods,CNSFDP has higher accuracy to find the actual number of clusters.(3)Based on the above work,an automatic clustering model for text-oriented data is designedIn the section three,an automatic clustering algorithm called ACFDP is proposed.Based on ACFDP,we establish an automatic clustering model for text documents.This model contains the steps of word segmentation,filtrating stop words,the establishment of vector space model(VSM)and calculating term frequency inverse document frequency(TF-IDF).Then,using ACFDP clustering algorithm,we are able to deal with these data sets.Finally,we evaluate the clustering effect and help us to adjust the ACFDP algorithm at the same time.
Keywords/Search Tags:Cluster Analysis, Initial Center Selection, Cluster Number, Text Cluster
PDF Full Text Request
Related items