Faced with the complex and diverse massive data,how to mine valuable information from big data and perform data analysis has become one of the research hotspots in the era of artificial intelligence.Clustering analysis is one of the important research topics in data mining technology.The meaning of so-called clustering is to divide data without category labels into several categories according to certain criteria,so that the similarity within the samples in the same category is maximized,while the similarity between samples in different categories is minimized.Traditional clustering algorithms belong to hard clustering,that is,two-way clustering.This type of clustering requires clear boundaries between categories,but in the actual clustering process,insufficient information is often encountered.If the data objects are forcibly divided into a certain cluster,it will increase the probability of misclassification and lead to a decrease in clustering accuracy.In response to the shortcomings of traditional clustering methods,Yu Hong et al.applied the theory of threeway decision to clustering and proposed the theory of three-way clustering.Compared with two-way clustering algorithms,three-way clustering algorithms introduces the concept of boundary domains,which can effectively solve the problem of inaccurate partitioning due to incomplete information or insufficient data in traditional two-way clustering algorithms.This article introduces a method of measuring sample’s similarity and applies it to clustering ensemble and three-way clustering algorithm.The main research content is as follows:(1)An Adaptive Three-way Clustering Ensemble Algorithm Based on Sample’s Similarity.First,this algorithm needs to generate the basic clustering results.Different from existing clustering ensemble algorithms,this algorithm uses partial features of the samples to obtain basic clustering results.Then,cluster label matching algorithm is used to match the basic clustering set and obtain a set of basic clustering results,which are used to construct sample’s similarity.Next,based on sample’s similarity,an equivalence relation is defined to calculate the core region and boundary region of sample subsets,and further define roughness.Further,based on roughness,the partition validity index is defined to measure the clustering performance under different partition thresholds,and a threshold adaptive selection algorithm is proposed.In the basic clustering ensemble stage,the majority voting method is used to integrate the basic clustering results to obtain preliminary clustering results,and then the threshold adaptive selection algorithm is applied to each cluster of the clustering results.By using a set of optimal thresholds obtained,all clusters are partitioned to obtain the final core region and boundary region sets.Through comparative experiments on UCI datasets,the effectiveness of the algorithm is verified.(2)The Adaptive Three-way Clustering Algorithm Based on Neighborhood Sample’s Similarity.Combining the idea of K-nearest neighbor algorithm,the neighborhood sample’s similarity is constructed.Then,the intra-cluster similarity and inter-cluster similarity are defined,which can reflect the tightness of sample distribution in each cluster and the difference of sample distribution between different clusters,respectively.Next,by integrating the intra-cluster similarity and inter-cluster similarity,a clustering effectiveness index is defined to measure the relationship between clustering performance and the number of clusters.Cluster number self-adaptive selection algorithm is proposed to automatically obtain the optimal number of clusters.Finally,the adaptive three-way clustering method based on neighborhood sample’s similarity is proposed by integrating the clustering effectiveness index and the partition effectiveness index.The final three-way clustering results are obtained based on the optimal number of clusters and the optimal threshold.The rationality of the algorithm is verified through multiple sets of comparative experiments. |