Font Size: a A A

Research And Implementation Of Clustering Method For High Dimensional Categorical Data

Posted on:2018-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:J TanFull Text:PDF
GTID:2348330536452520Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As an unsupervised machine learning method,clustering analysis divides the original chaotic data into a series of clusters according to certain rules,which makes each cluster is consists of data with high similarity,this brings great convenience for the subsequent data processing.Clustering analysis has been widely used in many fields such as network service,geography,biology,business and so on.However,with the increasing of data generation channels and the development of data collection technology,the data size and data dimensions used for the analysis are also growing.Traditional clustering algorithms can't achieve better clustering results on these datasets.As a research hotspot in the field of high dimensional data clustering,soft clustering has attracted more and more attention.However,at present,most existing soft subspace clustering algorithms are based on the k-modes algorithm,the similarity calculation between these algorithms and the computation of attribute weights depend on the class center(modes)selection,thus the clustering quality of this algorithm depends on the selection of the modes.At the same time,existing soft subspace clustering algorithm does not distinguish between missing data and complete data when clustering,which greatly influences the final clustering results.In this paper,for high dimension incomplete categorical data,we propose a new soft subspace clustering algorithm based on the height-to-width ratio of the cluster histogram and combine the idea of soft subspace clustering.The algorithm is based on the efficient clustering algorithm CLOPE.First,the attributes are weighted according to the average mutual information of the attributes,at the same time,combined with rough set to deal with missing data sets.Then,to solve the problem that the clustering quality of CLOPE algorithm is affected by the order of data input,a ?shuffle model? is proposed to eliminate the influence of data input order on the final clustering quality.Finally,the algorithm is implemented on Spark platform by scala language,which can be used for large scale data clustering.In this paper,the real data in UCI is selected as experimental data,and 4 groups of experiments are set up to verify the effectiveness and scalability of algorithm in this paper respectively.The experimental results show that the clustering quality of the proposed algorithm(the version of remove the rough set)is better than CLOPE,and the advantage of the missing data processing method is more obvious with the increase of the data missing rate.At last,compared with the other two typical soft subspace clustering algorithms,this algorithm has obvious advantages both in clustering quality and running time.
Keywords/Search Tags:categorical data, subspace clustering, soft subspace clustering, CLOPE, rough set, mutual information, Spark
PDF Full Text Request
Related items