Research Of Parallel Clustering Algorithm For Categorical Data

Posted on:2016-08-31

Degree:Master

Type:Thesis

Country:China

Candidate:T Guo

Full Text:PDF

GTID:2298330452466421

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

As an unsupervised learning methods, Clustering algorithm has a lot of applications in manyfields. Currently, numerical data for the cluster analysis has been thoroughly studied, but thesemethods for numerical data are not suitable for categorical data that abound in the reality.Categorical data are made up of non-numerical attributes, with different clustering modelscompared to numerical data. For large scale and high-dimensional categorical data, cluster analysisis more challenging. In this thesis, we research on clustering, especially parallel clustering forcategorical data, and implement a parallel clustering algorithm which handle large-scalecategorical data effectively.Firstly, we investigate clustering algorithm, especially for categorical data, then investigateCLOPE which is a representative categorical attribute oriented clustering algorithm. CLOPEdefines a global criterion function based on histogram to judge the quality of clustering and achievegood results in clustering large, sparse transaction databases with high dimensions. However, wefind that the data order can affect the process of clustering, it is hard to stably find the global optimalclusters. To deal with this problem, we propose MRCLOPE algorithm by iterative equally dividingand then permutating the data. For each iterative process of MRCLOPE algorithm, divides the inputdataset into p section and group them into p! new datasets with permutation and combination, thenclustering each dataset,and chooses the optimal clustering according to the profit as the input of thefirst step of the next iteration.Secondly, to deal with the same intermediate results during the computation process of MRCLOPE, we put forward a result reuse strategy, by utilizing the reusable intermediate results cangreatly improve the speed of clustering.Thirdly, In order to further handle the time complexity of the process, we put forward adistributed solution that implement MRCLOPE on Hadoop using MapReduce parallel platform, theclustering tool has been released to the open source community(https://github.com/j2cms/mrclope).Finally, use three representative categorical datasets (e. g., mushroom datasets) to testMRCLOPE algorithm. Experiments show that, MRCLOPE can achieve better results than CLOPE.For the mushrooms dataset, when CLOPE achieve optimal results, MRCLOPE can achieve35.7%larger profit value than CLOPE. When dealing with big data, compare with serial MRCLOPE,parallel MRCLOPE greatly shorten the computing time, has achieved a good speedup.In this thesis, we proposed an idea that equally divide and then permutates the data, which hasreference significance to other algorithms affected by the order of the input data.

Keywords/Search Tags:

categorical data, CLOPE, MRCLOPE, parallel clustering, MapReduce

PDF Full Text Request

Related items

1	Research And Implementation Of Clustering Method For High Dimensional Categorical Data
2	Parallel Clustering Algorithm Based On MapReduce
3	A Study On Clustering Algorithms For Categorical Data With Applications
4	Research On The Clustering Algorithm Of Parallel Partition Based On MapReduce
5	Automatic categorical data clustering and spatial data clustering by consecutive resolution refinement
6	Parallel Text Clustering Based On MapReduce
7	The Research On Clustering Algorithm For Categorical Data Using Quantum Mechanics
8	Studies On Clustering Algorithms For Categorical Data
9	Study Of Algorithms For Clustering Categorical Data
10	Studies On Clustering Algorithms For Categorical Data