Cluster ensemble integrates the multiple partitions of a dataset into a new clustering,which discloses the cluster structure information of all the base clusters to the greatest extent.The qualities of base clusters are obviously crucial to the final ensemble result.K-means is one of the most used algorithms to produce base partitions,as it can be implemented easily and the corresponding computational cost is low,and furthermore,its clustering mechanism conforms to the assumption in machines learning that the class conditional probability of local data is a constant.But K-means usual y adopts Gaussian distance as the distance measure,thus it can only find the clusters of spherical shape.It is also unable to generate high-quality base clusters when applied to datasets with complex structures,especially those whose class structures are not distributed spherically but based on connectivity.Therefore,this paper presents an optimization method for base clusters,namely,to judge the homogeneity of the clusters generated by K-means and partition those with poor homogeneity once again to improve the homogeneity.As a result,the quality of the entire cluster ensemble is improved.The experiments on 8 datasets demonstrate the effectiveness of the proposed method.At the same time,this paper presents a clustering ensemble based on the refined association matrix and can get a more stable and accurate final clustering.This scheme is composed of two layers.In the first layer,multiple K-means applying to the dataset contributes to getting a number of base clusters,generating a refined association matrix after integrating each base cluster independently and iteratively.Compared with the traditional association matrix,this matrix is better on reflecting the internal structure information.In the second layer,by applying refined association matrix,computing intra-class homogeneity and inter-class comparability,guiding the partition and merging of each base cluster,and generating the final clustering.Results of experiments on 8 types of synthetic and real data(from UCI)show promising availability of the proposed approach. |