Font Size: a A A

Semi-supervised Non-negative Matrix Factorization And Its Application In Document Clustering

Posted on:2013-11-10Degree:MasterType:Thesis
Country:ChinaCandidate:L LanFull Text:PDF
GTID:2298330422973842Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of computer network, the Internet information (includingnews, forums, online shopping and advertising) increasing sharply, which greatlyfacilitate the social life and human work. Most of the Internet information is displayedvia text pages, and how to effectively analyze and process text information has becomean pressing problem of Internet-oriented data mining. Text page is characterized by highdimension, diverse format and large amount of data, text clustering techniques gatheredsemantically similar text into text cluster, and according to the clustering results, we cantake the idea of “divide and rule” to analyze each text cluster and reduce the complexityof the text data processing. Therefore, text clustering play an important rule in the textdata mining, undoubtedly it has aroused widespread concern and made considerabledevelopment.Traditional clustering methods belong to the category of unsupervised learning,which uses the statistical properties of data to gather the “similar” samples together. Themost typical method is the K-means algorithm and vector quantization algorithm.Non-negative matrix factorization (NMF) is an novel unsupervised clustering algorithmproposed in recent years, which relax the constraints of the K-means algorithm andvector quantization algorithm, and replaced by a non-negative constraints. The clustercenter and clustering indicates vector derived from non-negative matrix factorization areboth non-negative, it assumes that each sample is the weighted summation of all clustercenters, and the corresponding weight measured the relationship between the sampleand cluster centers. Therefore, the weight can easily determine the category of thesample. Although the statistical characteristics of the data can represent data effectively,the ability is limited. To solve this problem, this paper inspired by the semi-supervisedlearning and proposed semi-supervised non-negative matrix factorization algorithms.Semi-supervised non-negative matrix factorization using a small amount of labeled datato improve the remaining unlabeled data clustering performance. Semi-supervisedlearning is a hot research topic in the field of machine learning, mainly to solve theproblem of insufficient labeled data in classification algorithms. For example, in thetraining process, the support vector machine learn an ultra classification surface basedthe training data, then classify the test data with the surface. However, when lack oftraining data, the derived classification surface cannot satisfy test data classificationeffectively, often referred to as over-fitting problems. If taking into account theunlabeled data in the training process, the over-fitting problem can be preventeffectively.According to the basic idea of semi-supervised learning, this paper proposed twotypes of semi-supervised non-negative matrix factorization algorithms, graphics-based semi-supervised non-negative matrix factorization (GSS-NMF) and semi-supervisednon-negative patch alignment framework (SS-NPAF). GSS-NMF embeded the labelinformation of labeled samples based on graph, and improved the semi-supervisedclustering performance. SS-NPAF enhanced the discrimiant of clustering byconstructing semi-supervised local patch. Both the methods utilize the label informationof labeled samples as well as geometric structure of unlabeled samples in NMF. Themultiplication rule is the most popular non-negative matrix factorization optimizationalgorithm, which start from the non-negative initial value, the matrix factor multipliedby the non-negative matrix to update and ensure that this update reduce the objectivefunction value, with the advantages of simple and low computation. However, themultiplication rule drawback is the slow convergence, and the optimization oftenrequire thousands of iterations to converge. In this paper, we developed C-J Lin’sprojected gradient method to optimize the GSS-NMF and SS-NPAF, which achievedbetter convergence rate than the multiplication rule, in the case of maintaining the samecomputational complexity. And the proposed gradient projection method, like C-J Lin’salgorithm, theoretically ensured convergence to the GSS-NNF and SS-NPAFoptimization problems stagnation.In order to verify the effect of GSS-NMF and SS-NPAF clustering algorithm. wetested on the benchmark database Reuters-21578and TDT-2, and compared withtraditional non-negative matrix factorization algorithms and other semi-supervisednon-negative matrix factorization algorithm. Experimental results show that:1)GSS-NMF reinforces the discriminative power by transferring the discriminativeinformation from labeled samples to unlabeled samples, and greatly improve theclustering performance;2) relative to NPAF, SS-NPAF constructed semi-supervisedlocal patch following the framework of non-negative matrix factorization, and this kindof local patch showed good performance in clustering;3) with respect to themultiplication rule, the development of a projection gradient method can be moreefficiently optimize the GSS-NMF and SS-NPAF.
Keywords/Search Tags:data mining, document clustering, semi-supervised learning, non-negative matrix factorization
PDF Full Text Request
Related items