Font Size: a A A

Research On Clustering Algorithm For Mixed Datasets

Posted on:2020-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:S Y JiangFull Text:PDF
GTID:2428330575996964Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Many mixed datasets with both numerical and categorical attributes have been collected in various fields,including medicine,biology,etc.Since the numerical and categorical attributes data have different characteristics,when clustering unlabeled data,the two types of data need to be treated differently.Designing appropriate similarity measurements plays an important role in clustering these datasets.In order to handle these two types of data uniformly,two clustering algorithms are proposed for clustering mixed datasets from two perspectives.1)A clustering algorithm based on simplex vector mapping: Firstly,the categorical attribute data are mapped into numerical vectors based on the simplex theory,and it is proved that the vectors of any two values of the same attribute have equal Euclidean distances.Thus,the categorical attribute data are converted into numerical data.Secondly,the converted pure-numerical data are applied into the K-Means algorithm framework.Finally,extensive experiments show that the clustering algorithm based on simplex vector mapping outperforms Ng K-modes algorithm,Cao's K-modes algorithm and traditional vector mapping clustering algorithm with 1.72%,2.74% and 1.86% improvement,respectively,in terms of averaged accuracy on four categorical datasets from UCI.Besides,Experiments on four mixed datasets show that the averaged clustering accuracy of the proposed algorithm is 2.68% and 2.22% higher than that of the traditional vector mapping clustering algorithm and the K-Prototype algorithm,respectively.2)A clustering algorithm based on entropy weighting: Firstly,the numerical data are transformed into categorical data by an automatic categorization technique.Secondly,the amount of information contained in each attribute are calculated based on the information entropy theory,and a weighting method for the categorical attributes are designed to formulate the similarity measure method.Finally,the proposed similarity measure is applied into the discretized data.Experiments on six UCI mixed datasets show that the clustering accuracy of the clustering algorithm based on entropy weighting is better than OCIL and K-Prototype methods,which are increased by 2.13% and 4.28%,respectively.And the algorithm improves the average clustering accuracy by 6.09%,compared with the K-Means algorithm on the six numerical datasets.Both of the clustering algorithms proposed in this thesis can handle the mixed datasets that contain numerical and categorical attributes.In particularly,the clustering algorithm based on entropy weighting verifies that each attribute contains different amounts of information,and has different effects on the clustering result.This can provide some guidance for designing clustering methods for mixed datasets.
Keywords/Search Tags:vector mapping, entropy-based weight, similarity measurement, mixed datasets, clustering analysis
PDF Full Text Request
Related items