Font Size: a A A

The Effect Of Labeled Sample Scale On Semi-supervised Text Clustering

Posted on:2015-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:R G MoFull Text:PDF
GTID:2268330428984284Subject:Software engineering
Abstract/Summary:PDF Full Text Request
During the past few years, semi-supervised learning has captured a great deal of attentions.In this research field, the labeled sample scale could affect the clustering result significantly.However, it is always a suspended problem that how many labeled samples are perfect. In thispaper, we try to reveal it and find the solution in text clustering. Based on two state-of-artclustering algorithms, namely k-means and Affinity Propagation (AP), we implement fivesemi-supervised clustering algorithms (Seeded kmeans (SK-means), constrained k-means(CK-means), loose seeds Affinity Propagation (LSAP), compact seeds Affinity Propagation(CSAP), and Tri-Set seeds Affinity Propagation (SAP)) to trail the effect of labeled samplescale.We apply the five algorithms to two benchmark data sets in text mining: Reuters-21578and NSF Research Award Abstracts1990-2003. Numerical results show that the increasingnumber of labeled samples may not always help the clustering algorithms to get a bettersolution. When the labeled sample scale is beyond the check point of35%for k-means basedalgorithm or25%for AP based algorithm, the learning ability of these algorithms will bestuck in a rut or will grow slowly. The experimental results can provide help forsemi-supervised clustering application. Researchers can select different algorithms accordingto different purposes.
Keywords/Search Tags:Semi-supervised Clustering, Labeled Sample, Text Clustering
PDF Full Text Request
Related items