Research On Clustering Algorithm For Mixed Datasets

Posted on:2020-03-16

Degree:Master

Type:Thesis

Country:China

Candidate:S Y Jiang

Full Text:PDF

GTID:2428330575996964

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Many mixed datasets with both numerical and categorical attributes have been collected in various fields,including medicine,biology,etc.Since the numerical and categorical attributes data have different characteristics,when clustering unlabeled data,the two types of data need to be treated differently.Designing appropriate similarity measurements plays an important role in clustering these datasets.In order to handle these two types of data uniformly,two clustering algorithms are proposed for clustering mixed datasets from two perspectives.1)A clustering algorithm based on simplex vector mapping: Firstly,the categorical attribute data are mapped into numerical vectors based on the simplex theory,and it is proved that the vectors of any two values of the same attribute have equal Euclidean distances.Thus,the categorical attribute data are converted into numerical data.Secondly,the converted pure-numerical data are applied into the K-Means algorithm framework.Finally,extensive experiments show that the clustering algorithm based on simplex vector mapping outperforms Ng K-modes algorithm,Cao's K-modes algorithm and traditional vector mapping clustering algorithm with 1.72%,2.74% and 1.86% improvement,respectively,in terms of averaged accuracy on four categorical datasets from UCI.Besides,Experiments on four mixed datasets show that the averaged clustering accuracy of the proposed algorithm is 2.68% and 2.22% higher than that of the traditional vector mapping clustering algorithm and the K-Prototype algorithm,respectively.2)A clustering algorithm based on entropy weighting: Firstly,the numerical data are transformed into categorical data by an automatic categorization technique.Secondly,the amount of information contained in each attribute are calculated based on the information entropy theory,and a weighting method for the categorical attributes are designed to formulate the similarity measure method.Finally,the proposed similarity measure is applied into the discretized data.Experiments on six UCI mixed datasets show that the clustering accuracy of the clustering algorithm based on entropy weighting is better than OCIL and K-Prototype methods,which are increased by 2.13% and 4.28%,respectively.And the algorithm improves the average clustering accuracy by 6.09%,compared with the K-Means algorithm on the six numerical datasets.Both of the clustering algorithms proposed in this thesis can handle the mixed datasets that contain numerical and categorical attributes.In particularly,the clustering algorithm based on entropy weighting verifies that each attribute contains different amounts of information,and has different effects on the clustering result.This can provide some guidance for designing clustering methods for mixed datasets.

Keywords/Search Tags:

vector mapping, entropy-based weight, similarity measurement, mixed datasets, clustering analysis

PDF Full Text Request

Related items

1	Research Of Clustering Algorithms For Mixed Data Based On Attribute Weighting And Similarity Measuring
2	Research And Application Of Clustering Algorithm On The High Dimensional Datasets
3	Research On Similarity-Based Ontology Mapping Approach
4	Improved Affinity Propagation Clustering Algorithms And Their Applications
5	Research On Directional Clustering And It's Applications
6	Research On UBI Rate Determination Method Based On Entropy Weight-Topsis And Clustering
7	Research On Clustering Algorithm For Mixed Attributes And Application
8	Entropy-Based Clustering Algorithm For Module Detection In PPI Networks
9	Research On Text Similarity Algorithm Based On Vector Space Model
10	Support Vector Clustering Method And Its Applications To Biomedical Datasets