Font Size: a A A

Research Of Text Clustering Based On NMF Algorithm

Posted on:2015-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y F JuFull Text:PDF
GTID:2298330422987405Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As one of the most important research topics in data mining and patternrecognition, clustering analysis has been widely used in areas such as datacompressing, document clustering, information retrieval, image segmentation, etc.Inrecent years, with the rapid growth of on-line information, the document clustering isbecoming more and more important in the field of information retrieval and memorymanagement and so on.Text data with high dimensionality and sparseness, etc., whichmakes many clustering algorithms cannot be directly used for text clustering; addition,the massive scale of the text sets imposes high efficiency on clustering algorithms.Vector space model is the traditional model for representing text documents asvectors. Due to the high-dimensional, sparse features of document, NMF algorithmwill be used in this article.NMF is a new method for feature extraction. Because thenon-negative limitation for the results of factorization, the features based on NMFreflect more localized characteristics of the samples. Therefore, the feature vectorsextracted by NMF are easier to explain and forecast.This thesis introduces the basic ideas and basic algorithms of non-negativematrix factorization, due to the non-negative matrix factorization algorithm convergesslower, slowly and tends to converge to poor solution.Therefore, NMF algorithm hasbeen improved in this paper, using FCM algorithm to initialize. Secondly, due to thelarge size of the text, clustering algorithm requires even more stringent, standardk-means algorithm needs to be calculated the distance from each sample point to allcluster centers in each iteration. Which waste of a lot of computation time, especiallywhen a particularly large amount of data, for this problem this paper proposes animproved k-means algorithm. As many clustering algorithms require the number ofclusters before clustering, which does not know in advance, for which a newclustering algorithm FGClus proposed in this paper. Experiments show that theimproved k-means algorithm and the proposed FGClus algorithm are effective.Finally, this article will use NMF and improved NMF integrated with k-meansalgorithm, the improved k-means algorithm and the proposed algorithm ofFGClus,the experimental results showed NMF get integrated with the clusteringalgorithm are superior to the direct use of clustering algorithm for high dimensionalsparse text vector and improved NMF algorithm can not only produce more accurateclustering results, but also improves the efficiency of the algorithm.
Keywords/Search Tags:Text clustering, non-negative matrix factorization, clustering analysis, text cluster integration, k-means algorithm
PDF Full Text Request
Related items