Font Size: a A A

Research On Key Technologies Of Clustering Ensemble

Posted on:2008-06-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:H L LuoFull Text:PDF
GTID:1118360242472943Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Ensemble Learning is combining multiple learned models to solve a problem. Since the mid-1990s, ensemble learning has gradually become the most popular research direction in machine learning. Early, ensemble learning has focused on the supervised learning. Till the last five years there has been a lot of activity in clustering ensemble research. The key technologies of clustering ensemble are investigated in this dissertation, and the main contributions of this dissertation are summarized as follows:Firstly, a new clustering approach named COHMMOP is proposed, which is based on the use of mathematical morphology operations. Through the algorithm COHMMOP, clusters are detected as well separated subsets called clustering core by means of hierarchical mathematical morphology procedures. Based on COHMMOP, a clustering ensemble algorithm named CEOMM is proposed, which combine multiple clustering cores explored by different structure elements to get a desirable and correct clustering core of data set. And then CEOMM get the clustering of the data set based on the ensemble clustering core. Experimental results demonstrate both COHMMOP and CEOMM are able to cluster data with complex cluster shapes better than the classical clustering algorithms, and they can also find an optimal number of clusters. Moreover, CEOMM can discover overlapping clusters with different arbitrary shapes, because it uses different structure elements.Secondly, four diversity measures for clustering ensembles are proposed. Six experiments have been designed to examine the relationships between the accuracy of the clustering ensembles and the measures of diversity under conditions of difference ensemble methods, different ensemble size and different data distributions respectively. Experiments show the relationships of these diversity measures and ensemble performance are not monotonous. However, when constructing ensembles with moderate ensemble size by suitable clustering algorithms for a given data set with uniform cluster distribution, the correlation coefficients between the diversity measures and ensemble performance are relatively high. Finally, some useful suggestions about the usefulness of diversity measures in building clustering ensembles are proposed.Thirdly, a method focused on generating clustering ensembles with high diversity named CEAN is proposed. By introducing the artificial data. CEAN can obtain clustering ensembles with high diversity. And based on CEAN. an improved diversity ensemble constructing method named ICEAN is proposed. ICEAN chooses diverse clusterings of a large clustering ensemble produced by CEAN to get a smaller and more diverse ensemble. The experimental results show both CEAN and ICEAN can get ensembles with higher diversity than other popular clustering ensemble constructing approaches, especially ICEAN always get the most diverse ensembles. So under the same average ensemble member accuracy, CEAN and ICEAN can get better clustering integration effect.Fourthly, a consensus scheme named CMCUGA via the genetic algorithm based on information theory is proposed. A combined clustering is found by minimizing an information-theoretical criterion function using genetic algorithm. Experimental results demonstrate the effectiveness of the proposed method. Additionally a consensus scheme via categorical data clustering algorithm is proposed. A combined partition is found as a solution to the corresponding categorical data clustering problem using the k-modes and LIMBO algorithm. Experimental results demonstrate the effectiveness of the proposed method.Fifthly, a clustering algorithm based on ensemble and spectral technique named CBEST that works well for data with mixed numeric and categorical features is presented. A similarity measure based on clustering ensemble is adopted to define the similarity between pairs of objects, which makes no assumptions of the underlying distributions of the feature values. A spectral clustering algorithm is employed on the similarity matrix to extract a partition of the data. The performance of CBEST has been studied on artificial and real data sets. Results demonstrate the effectiveness of this algorithm in clustering mixed data tasks and its robustness to noise. Comparisons with other related clustering schemes illustrate the superior performance of this approach. Moreover, CBEST can infuse prior knowledge effectively.
Keywords/Search Tags:clustering, ensemble learning, clustering ensemble, diversity, measure, mathematical morphology, mixed data clustering, categorical data clustering, artificial data
PDF Full Text Request
Related items