Font Size: a A A

Research On Clustering Based On Attribute Characteristics For Categorical And Binary Data

Posted on:2020-08-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:L W FuFull Text:PDF
GTID:1368330575973144Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
The development of information technology has been leading to a growing number of data and various data types in the management field,n order to tackle with these unlabeled data,Managers are in urgent need of unsupervised learning tools.Clustering analysis is an important segment of unsupervised learning and can assist management and decision making significantly.In the previous researches,where studies on clustering analysis for numerical data has achieved great outcomes,researches on clustering analysis of categorical and binary data is still insufficient.This paper starts from characteristics of binary attribute and classification attribute,makes research on clustering algorithms for categorical data,selection of internal validation indices,internal validation indices for binary and categorical data,and intends to offer a complete solution to clustering analysis.It includes four parts as follow.(1)The internal evaluation of cluster validity is a key part of cluster analysis.Since each internal validity evaluation index has an applicable scope,the selection of index is an important issue in clustering internal validity evaluation.We focus on methods of selecting internal indices,particularly through external indices,and then,analysis clustering external validity indices' biases and deficiencies cooperated in process evaluating internal indices.By compounding several external indices,this paper proposed Strategy of Internal clustering validity indices selected based on Dempster-Shafer evidence theory(SIDS).Experiment shows that SIDS can incorporate multiple evaluation results of external indices and select the optimal one among several internal indices.(2)To address evaluation capability of internal clustering validity indices for categorical data,we analyses the distribution of objects on categorical attributes,and define Strength of concentration Vector for a cluster,SV.Then dissimilarity based on DisCRePancy of SVs(DCRP)is defined.And by degree of concentration of the objects on categorical attributes,similarity based on CONCentration of attribute values(CONC)is proposed.On the basis of CONC and DCRP,we proposed Clustering Validation based on Concentration attribute values(CVC),and method to selection of parameter.(3)To address evaluation capability of internal clustering validity indices for binary data,we analyses the distribution of objects on binary attributes,and define three types of these attributes by raising standards of concentration.Based on differences of attribute types,we proposed dissimilarity of two clusters for Binary Data(DBD)and further Clustering Validation index based on Type of Attributes altering for Binary data(CVTAB).Results of experiment show CVTAB perform well in evaluation.What's more,on account of binary and categorical attributes mutual converting,the effects of converting are analysed and tested.Experiment shows CVTAB is more suitable for binary data,and CVC is more suitable for categorical data.(4)For clustering algorithm for categorical data,k-SV is proposed on the basis of SV,DCRP and k-modes algorithm framework.Results of experiment show k-SV algorithm performs well and stably in clustering analysis.Finally,taking the recruitment process as an example,this paper verified the effectiveness of the clustering algorithm and internal validation indices based on the distribution of attribute values applied in the field of management.
Keywords/Search Tags:Clustering algorithm, Internal validity evaluation, Categorical data, Binary data
PDF Full Text Request
Related items