During the past few years, semi-supervised learning has captured a great deal of attentions.In this research field, the labeled sample scale could affect the clustering result significantly.However, it is always a suspended problem that how many labeled samples are perfect. In thispaper, we try to reveal it and find the solution in text clustering. Based on two state-of-artclustering algorithms, namely k-means and Affinity Propagation (AP), we implement fivesemi-supervised clustering algorithms (Seeded kmeans (SK-means), constrained k-means(CK-means), loose seeds Affinity Propagation (LSAP), compact seeds Affinity Propagation(CSAP), and Tri-Set seeds Affinity Propagation (SAP)) to trail the effect of labeled samplescale.We apply the five algorithms to two benchmark data sets in text mining: Reuters-21578and NSF Research Award Abstracts1990-2003. Numerical results show that the increasingnumber of labeled samples may not always help the clustering algorithms to get a bettersolution. When the labeled sample scale is beyond the check point of35%for k-means basedalgorithm or25%for AP based algorithm, the learning ability of these algorithms will bestuck in a rut or will grow slowly. The experimental results can provide help forsemi-supervised clustering application. Researchers can select different algorithms accordingto different purposes. |