Font Size: a A A

Research On Semi-supervised Clustering Ensemble Based On Soft-voting

Posted on:2015-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:H S WangFull Text:PDF
GTID:2268330428476089Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Clustering analysis is one of the most widely used techniques in data mining. The principle is that firstly clusters all of data objects, and then analyzes the results to find implied information with practical value. Clustering divides the large and confusion data objects into several clusters based on the similarity degree of all data objects with the purposes of "data objects within the same cluster with the maximum similarity, data objects in different data clusters with the minimum similarity". The clustering ensemble is a process that uses the results of different clustering algorithms or the same algorithm many times with different parameters setting as based clustering results, selects an appropriate consistency function to integrate all the based clustering results, and then obtains a new clustering result. Clustering belongs to unsupervised learning methods, and semi-supervised clustering is the methods by adding a small number of priori knowledge, known as semi-supervised information, into the process of clustering to improve clustering performance. Semi-supervised clustering ensemble combines both the advantages of semi-supervised clustering and clustering ensemble by the semi-supervised information to guide clustering ensemble to obtain a better result.Depending on the way that objects are assigned to clusters, clustering methods are generally divided into two kinds:hard clustering and soft clustering. The result of hard clustering is a group of cluster labels, which means one data object only belongs to one cluster. The result of soft clustering is a matrix of membership degrees, which means every data object may belong to any cluster with different membership degrees. Some scholars have already proved that the result of soft clustering is better than hard clustering in some respects. Traditional ensemble algorithms are usually used the results of hard clustering as input, in order to solve an ensemble formed of soft clustering using one traditional ensemble algorithm we have to "harden" the soft clustering results, and this process results in the loss of some valuable information. To solve such problem, this thesis proposes a new ensemble approach for soft clustering results, which is called soft Soft-Voting Clustering Ensemble. This algorithm has better flexibility and generalization, and experiments show this algorithm obtains better clustering results.To further improve the performance of the Soft-Voting algorithm, the thesis also attempts to use semi-supervised information to guide the clustering ensemble process. In this thesis, the semi-supervised information is represented by two forms:pairwise constraints and cluster labels, and two corresponding semi-supervised Soft-Voting clustering ensemble algorithms are designed. Experimental results show that the two forms of semi-supervised information both improve the accuracy of clustering results to a certain extent.
Keywords/Search Tags:Clustering analysis, soft clustering, soft-voting, semi-supervised soft-voting
PDF Full Text Request
Related items