Font Size: a A A

Research On Theory And Technology Of Semi-Supervised Clustering Ensemble

Posted on:2014-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:D H ChenFull Text:PDF
GTID:2248330398474682Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering analysis is an important technology in the areas of data mining and machine learning. It is widely used in many fileds, especially in the processing and analysis of the big data. According to a kind of given measure of similarity, clustering can divide all the data objects into several clusters, which should maximize the similarity between intra-class objects and minimize the similarity between inter-class objects. In practical issues, unsupervised clustering algorithms perform without considering any prior knowledge, and a single clustering algorithm is very hard to meet the processing of datasets which structure or distribution is complex. But the simi-supervised clustering ensemble can just to make up for this deficiency, which makes full use of semi-supervised learning and ensemble learning technology to clustering analysis. It could effectively improve the performance of clustering. However, due to the research of simi-supervised clustering ensemble is just emerging, and there are few studies in the theoretical analysis. The theoretical study can provide solid foundation for the development of semi-supervised clustering ensemble.Semi-supervised clustering ensemble technology is fully used of the prior knowledge to guide the clustering process, which can improve the performance of clustering, at the same time it uses ensemble learning technology to combine the base clusterings to get better results. By the revelation of the semi-supervised learning and clustering ensemble research, and combining the knowledge of probability and statistics, this thesis presents the mathematical analysis and discussion for semi-supervised clustering ensemble. Based on some assumptions, it gives the mathematical proof and analysis of convergence for semi-supervised clustering ensemble in the thesis. The author proposes the concept of robust radius to measure the degree of robustness and analyse the robustness of semi-supervised clustering ensemble. This thesis discusses a new relabeling approach based on contingency matrix to unify the base clustering (partition) labels, and then use pairwise constraints in the form of the prior knowledge, added to the model of semi-supervised clustering ensemble based on majority voting. The experimental results show that prior knowledge can improve the performace of base clustering and semi-supervised clustering ensemble, and semi-supervised clustering ensemble is provided with convergence and robustness, and the approach can obtain a better clustering effect.Semi-supervised clustering ensemble technology can effectively utilize the prior knowledge to guide clustering and ensemble process, which improve the performance clustering by aggregating multiple diversity partitions. In this thesis, it proves the convergence of semi-supervised clustering based on statistical technology, and presents a robust measure to analyse the robustness. And then a new semi-supervised clustering ensemble model based on majority voting is proposed. The experimental results show that, with increasing of the diversity base partitions number, semi-supervised clustering ensemble will be convergence and robustness. By prior knowledge, semi-supervised clustering method based on majority voting can get better performance than other clustering ensemble alogrithms.
Keywords/Search Tags:Clustering Analysis, Semi-supervised Clustering Ensemble, Convergence, Robustness, Majority Voting
PDF Full Text Request
Related items