Font Size: a A A

Research On Clustering Ensemble Methods And Their Applications

Posted on:2012-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:S ChenFull Text:PDF
GTID:2218330368483548Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering analysis is an important study field in data mining, which has been widely applied in the areas of statistics, biology and marketing. Hundreds of clustering algorithms have been proposed recently. However, conventional clustering algorithms often suffer from the curse of dimensionality thus producing bad performance for high-dimensional data. Soft subspace clustering is an effective means of processing high dimensional data. However, most existing soft subspace clustering algorithms contain parameters which are difficult to be determined by users. In real-world applications, it is difficult to find a single clustering algorithm which is able to handle the clusters with all types of shapes and sizes, and determine which clustering algorithm should be used for a particular dataset. Therefore, many scholars begin to study clustering ensembles methods. Clustering ensembles can go beyond a single clustering algorithm in robustness, novelty, stability, parallelization and scalability.The paper first gives a review on subspace clustering, clustering ensembles, semi-supervised learning and imbalanced data classification. To overcome the traditional subspace clustering algorithm, we then propose a new soft subspace clustering algorithm named SC-IFWSA, which does not require users to set any parameter values by using an improved feature weight self-adjustment mechanism (IFWSA), and can update adaptively the weights of all dimensions for each cluster by their adjustment margins. Based on clustering ensembles, we further propose two new methods to overcome the traditional semi-supervised classification and imbalanced data classification respectively, so as to improve classifier performance:(1) a new semi-supervised classification algorithm based on clustering ensembles named SSCCE is proposed, which uses an easily understandable labeling confidence estimation method. It first generates multiple partitions of the given data, and matches clusters in different partitions. Then the unlabeled training samples with high clustering consistency index are selected and added into the labeled training set after being labeled. Finally, a learner is trained on the enlarged labeled training set; (2) this paper proposes a type of novel classification for imbalanced data sets based on clustering ensembles, which aims to provide classification methods with a better training platform by introducing clustering consistency index to find the cluster boundary minority examples and the cluster center majority examples, and then using the improved synthetic minority over-sampling technique (SMOTE) and the modified random under-sampling method respectively to deal with imbalanced data sets. Experimental results show that the three proposed methods are effective and feasible, and can perform better on most data sets.
Keywords/Search Tags:Clustering Ensembles, Classification, High Dimensional Data, Subspace Clustering, Semi-Supervised Learning, unbalanced Data Sets
PDF Full Text Request
Related items