| With the rapid development of Internet technology and the ever-changing society,people have easy access to all kinds of data.Due to the extensive,complex,highdimensional data and the accompanying problems of coarseness,ambiguity,and uncertainty,finding helpful information from the data has become increasingly difficult.Clustering has received much attention from experts and scholars because it analyzes sample data without any a priori knowledge and identifies the underlying structural distribution and characteristics of the sample data only through the numerical relationships between them.However,a single clustering algorithm can often solve only one or several clustering problems and has yet to have generality in the face of various complex clustering problems.For example,in reality,multidimensional data sets may have multiple shapes or structures,and the potential structural distribution of their data cannot be distinguished using a single clustering algorithm,so the clustering ensemble algorithm has become one of the hot methods for data mining in recent years to discover the potential information of sample data sets.The clustering ensemble algorithm aims to obtain better class cluster partitioning by fusing multiple base clusters.A set of base cluster sets is first generated by clustering different single clustering algorithms or parameters of the same clustering algorithm multiple times and then by using a consistency function to obtain higher accuracy and better robustness clustering results than any single clustering algorithm.However,the current research on the clustering ensemble algorithm mainly focuses on optimizing the consistency function.There needs to be more research on the measurement and screening of the quality of the basic clustering members.Moreover,in the study of a three-way clustering ensemble,the thresholds used for base cluster screening are mostly artificially given and lack reasonable semantic interpretation.Therefore,this paper conducts an indepth study from the perspective of quality measurement and screening of base cluster members based on three-way decision theory,combining sample similarity,information entropy,and cluster ensemble.(1)This paper proposes an automatic three-way screening method(ATWSAS)for base clustering based on sample similarity to address the problem that threshold selection needs to be given manually in a three-way clustering ensemble.Firstly,multiple sets of base clusters with differences are generated for label matching using different single clustering algorithms.Then a base-clustering quality measure based on sample similarity is proposed.Secondly,a three-way decision domain quality concept is proposed based on the base clustering quality measure.An algorithm to automatically select the best threshold for the three-way decision is proposed based on the domain quality concept.The final set of base clusters selected by the algorithm participates in the final clustering ensemble.In this paper,the ATWSAS optimized clustering ensemble algorithm is used to compare with the traditional clustering ensemble algorithm,and experiments are conducted with several different data sets to show the effectiveness of the proposed algorithm in this paper.(2)Because the generated base cluster sets are inconsistently labeled in a three-way clustering ensemble,label-matching methods usually correspond to the samples one by one before filtering the basic clustering.In the process of label matching,it is easy to cause the loss of effective information,which will have a certain impact on the performance of cluster integration.Meanwhile,to judge whether the base clusters participate in the final cluster integration,only the quality of the base clusters is considered,ignoring the diversity among the base clusters.Moreover,the diversity of basic clustering can often influence the effect of the final clustering ensemble.To this end,an automatic three-way screening method(ATWSAAE)based on attribute and information entropy for basic clustering is proposed in this paper.Firstly,multiple sets of basic clustering with differences are generated using different single clustering algorithms.Then an attribute and information entropy based base clustering quality measure is proposed.Secondly,the new base clustering quality measure proposes a novel concept of three-way decision domain quality.A novel algorithm for automatically selecting the optimal threshold of the three-branch decision is proposed.Finally,the ATWSAAE optimized algorithm is compared with other clustering ensemble algorithms optimized by the ATWSAS algorithm for experiments to show the effectiveness of this proposed algorithm.The above two optimization algorithms propose various algorithms for automatically selecting the optimal threshold for the three-way decision from the perspective of base clustering quality metrics and screening.Experiments show that the optimization algorithms proposed in this paper all have good results. |