Three-way Clustering Analysis For Incomplete Data

Posted on:2021-02-20

Degree:Master

Type:Thesis

Country:China

Candidate:H Shi

Full Text:PDF

GTID:2428330611997568

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Cluster analysis or clustering,as an unsupervised data mining technique,allows us to classify similar objects into the same cluster according to the specific measurement method.It helps us identify patterns between elements,reveals the associations between objects,and discovers hidden data structures.Due to the many advantages of clustering,it has been widely applied to many fields such as machine learning,pattern recognition,image analysis,information retrieval,bioinformatics,data compression,computer graphics and so on.The clustering algorithm can be divided into partitioning methods,hierarchical methods,density-based methods,grid-based methods,and model-based methods.However,the traditional clustering algorithm is a hard clustering algorithm,that is,any objects belongs to at most one cluster,and there is no intersection between clusters.In order to better represent the data structure between objects,soft clustering algorithms such as rough k-means and rough-fuzzy k-means are proposed.Three-way clustering is used as a special soft clustering method,which combines three-way decision.Any cluster is composed of the core region and the boundary region.The three-way clustering considers the object that cannot determine the cluster,and can improve the clustering accuracy to a certain extent,meanwhile,reduce the decision-making risk effectively.In real situation,some values are lost due to data acquisition difficulties,random noise,data loss,data misreading,and so on.As a commonly used UCI database in the field of machine learning,more than 40% of data sets contain missing data,which are called incomplete data sets.Currently,most clustering algorithms can only process complete data sets and cannot process incomplete data sets.Therefore,this paper not only studies how to realize the clustering problem of incomplete data sets,but also discusses the three-way clustering problems of complete data sets.The main work of the paper includes the following aspects.(?)An improved mean imputation incomplete data clustering method based on k-means algorithm(KM-IMI)is proposed.Firstly,we randomly select the missing data according to the specific missing rate and divide the dataset into two disjoint sets.Secondly,we cluster the objects with non-missing values through traditional clustering algorithm.For each missing objects,we use the mean attribute's value of each cluster to fill the missing attribute's value based on the cluster results of the objects with non-missing values,respectively.Perturbation analysis of cluster centroid is applied to search the optimal imputation.The experimental clustering results on some UCI data sets are evaluated by several validity indexes,which can proves the effectiveness of the proposed algorithm.(?)Inspired by the KM-IMI algorithm,a method of three-way ensemble clustering incomplete data based on voting is proposed.Processing the dataset and obtain multiple base clustering results through ensemble clustering.Label matching for clusters in multiple clustering results,find the intersection of the same cluster label and divide the objects into the core domain of the corresponding cluster.Counting the number of votes of the remaining objects to determine whether the sample belongs to the core region or boundary region of the corresponding cluster.Finally,the three-way clustering result of the filled incomplete data set are obtained.(?)A three-way clustering model based on three-decision called TWKM is proposed.In the TWKM model,an overlap clustering is used to obtain the supports(unions of the core regions and the fringe regions)of the clusters and perturbation analysis is applied to separate the core regions from the supports.The difference between the support and the core region is regarded as the fringe region of the specific cluster.Therefore,a three-way explanation of the cluster is naturally formed.Meanwhile,we apply the spectral clustering to the TWKM model to form a three-way clustering algorithm named TWSC.Davies�Bouldin index(DB),Average Silhouette index(AS)and Accuracy(ACC)are computed by using core region to evaluate the structure of three-way clustering result.The experimental results show that such strategy is effective in improving the structure of clustering results.

Keywords/Search Tags:

cluster analysis, incomplete information system, three-way decision, three-way clustering, cluster validity index

PDF Full Text Request

Related items

1	Research On New Cluster Validity Index For Overlapping Datasets In Cluster Analysis
2	Research Of New Clustering Validity Index In Cluster Analysis
3	Research On The New Validity Index Of Internal Clustering And The Method To Determine The Optimal Cluster Number
4	Research Of Improved K-means Algorithm And New Cluster Validity Index In Cluster Analysis
5	The Research And Comparative Analysis Of Cluster Validity Index
6	Research On Connectivity-based Cluster Validity
7	Research On New Clustering Validity Index Based On Improved Clustering Algorithm
8	Research Of Fuzzy Clustering Algorithm And Cluster Validity Index
9	Research On Effective Internal Index Framework For Cluster Evaluation
10	Research Andapplication On Determining Optimal Number Of Clusters In Cluster Analysis