Font Size: a A A

Research And Application Of Semi-supervised Clustering Algorithms

Posted on:2011-05-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:R C GuanFull Text:PDF
GTID:1118360305453674Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Machine learning is a important area of Artifical Intelligence. Its purpose is to give computer learning ability and make good use of the obtained knowledge in application. In machine learing, supervised learning provides good exemplars for computer and there is plenty of information embedded in them. Furthermore, supervised learning also gives out the learing rules and right directions. On the other hands, without any exemplar and pre-known knowledge, un-supervised learning merely depends on the algorithms to mine the information. Due to its aimlessness, un-supervised learning is often used as a pre-processing tool in data mining. However, semi-supervised learning could do data anal-ysis and mining effectively with few examplars or little pre-known information. It also avoids the aimlessness and uncertainty. Recently, semi-supervised has became a new inter-esting direction of machine learning.Nonetheless, most of the successful semi-supervised machine learing algorithms are based on the semi-supervised classification and semi-supervised regression. There is little research in semi-supervised clustering. Accordingly, our researches mainly focus on the semi-supervised clustering ananlysis and proposed some new algorithms. To do compari-sion analysis, the new semi-supervised clustering algorithms are applied to text mining. At the meantime, for the nitty-gritty problem-labeled and unlabeled sample sizes effect on semi-supervised clustering, we implement experimental analysis in detail and draw the conclusions. The main contributions are as follows:1) A new Non-Euclidean Space similarity measurement containing the structure in-formation is proposed, which is namely Tri-Set Affinity Propagation. In text mining, many frameworks employ the vector space model and similarity metrics in Euclid space. By clustering texts in this way, the advantage is simple and easy to perform. However, when the data scale puffs up, the vector space will become high-dimensional and sparse. Then, the computational complexity grows exponentially. To overcome this difficulty, a Non-Euclidean Space similarity measurement is proposed based on the definitions of Sim-ilar Feature Set (SFS), Rejective Feature Set (RFS) and Arbitral Feature Set (AFS). The new similarity measurement not only breaks out the Euclidean Space constraint, but also contains the structural information of documents. Therefore, a novel clustering algorithm, named weight affinity propagation (WAP), is developed by combining the new similarity measurement and AP. In addition, as a benchmark dataset, Reuters-21578 is used to test the proposed algorithm. Experimental results show that the proposed method is superior to the classical k-means, traditional SOFM and Affinity Propagation with classic similarity mea-surement.2) A new semi-supervised clustering algorithm is proposed, which is called Seeds Af-finity Propagation. There are two main contributions: (1) utilization of Tri-Set similarity metric; (2) a novel seed construction method to improve semi-supervised clustering process. To study the performance of the new algorithm, we applied it to the benchmark data set Reuters-21578, and compared it with k-means, the original Affinity Propagation with Co-sine coefficient, Tris-Set Affinity Propagation and Semi-supervised Affinity Propagation with Cosine coefficient. Furthermore, we have analyzed the individual effect of the two proposed contributions. Experimental results show that the proposed similarity metric is more effective in text clustering (F-measures ca. 21% higher than in the AP algorithm) and that the proposed semi-supervised strategy achieves both better clustering results and faster convergence (using only 76% iterations of the original AP). The complete SAP algorithm provides enhanced robustness compared with all other methods, obtains higher F-measure (ca. 40% improvement over k-means and AP) and lower entropy (ca. 28 % decrease over k-means and AP), and improves significantly clustering execution time (twenty time faster) in respect than k-means.3) A new semi-supervised clustering algorithm with incremental learning is proposed, which is calledI?ncremental Affinity Propagation. Considering the original Affinity Propa-gation couldn't cope with part known data directly, a semi-supervised scheme called in-cremental affinity propagation clustering is proposed after the propostion of Seeds Affinity Propagation. In the scheme, the pre-known information is represented by adjusting similar-ity matrix. Moreover, an incremental study is applied to amplify the prior knowledge. To examine the effectiveness of the method, we concentrate it to text clustering problem and describe the specific method accordingly. The method is applied to the benchmark data set Reuters-21578. Numerical results show that the proposed method performs very well on the data set and has most advantages over two other commonly used clustering methods. On the experiments of four data scales, the average of F-measure for IAP is 36.2% and 31.2% higher than Affinity Propagation with Cosine coefficient (AP(CC)) and k-means, respectively; the average of Entropy is 14.7% and15.9% lower than AP(CC) and k-means, respectively.4) The effect of labeled sample scales on semi-supervised clustering is discussed and analyzed. To examine and investigate the effect of labeled sample sizes, based on k-means and AP, we implement five semi-supervised algorithms: Seeded k-means, Constrained k-means, Loose Seeds Affinity Propagation, Compact Seeds Affinity Propagation, and Seeds Affinity Propagation. All the five algorithms are applied to three benchmark data sets in text mining, namely: Reuters-21578, NSF Research Awards Abstracts1990–2003 and 20 Newsgroups. Moreover, we plot the F-measure and Entropy moving average curves for the five algorithms performance on the three data sets. Furthermore, we propose a sa-tisfactory rate measurement to search for the local optimal values, which is called check points. Experimental results show that the increase of labeled sample size can help semi-supervised clustering algorithms to achieve better results; however, when they arrive at the check points, the improvements will grow slowly or stagnate. On the other hand, the check points'location often varies with different semi-supervised clustering algorithms and different data sets.5) The effect of unlabeled sample sizes on semi-supervised clustering is discussed and analyzed. To observe the effect of unalabeled sample sizes on semi-supervised clustering, based on k-means and Affinity Propagation, we propose four novel semi-supervised algo-rithms: Incremental Seeded k-means, Incremental Constrained k-means, Incremental Affin-ity Propagation, and Incremental Seeds Affinity Propagation. These new algorithms are ap-plied to three benchmark data sets. Meanwhile, the F-measure and Entropy curves of unla-beled sample learning results from 1 to 400 are plotted. Then, similar as the analysis of la-beled sample tests, the satifactorty rate is applied to fing out the check points. The experi-mental results show that: in most cases, incremental learning with smaller unlabeled sample size can help the semi-supervised clustering algorithms to achieve better results. But when the unlabeled sample size increase over the check points, the improvement may decrease or even counteract. In addition, the check point locations vary with different semi-supervised clustering algorithms.In summary, first of all, based on the analysis of the classical clustering algorithms, we have proposed a few novel clustering algorithms by introducing semi-supervised learning strategies. These new algorithms obtain good clustering results in the experiments. Then, we further extend our study on the effect of labeled and unlabeled sample sizes for semi-supervised clustering. Based on the analysis of experimental results, it can be con- cluded that: both the labeled and unlabeled samples size increase can make improvements on semi-supervised clustering performance. However, when the labeled and unlabeled data sizes increase over check points, the samples'effect on semi-supervised clustering will re-duce or even disappear.
Keywords/Search Tags:Machine Learning, Semi-supervised Clustering, Affinity Propagation, k-means
PDF Full Text Request
Related items