Font Size: a A A

Research Of Parallel Clustering Algorithm For Categorical Data

Posted on:2016-08-31Degree:MasterType:Thesis
Country:ChinaCandidate:T GuoFull Text:PDF
GTID:2298330452466421Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As an unsupervised learning methods, Clustering algorithm has a lot of applications in manyfields. Currently, numerical data for the cluster analysis has been thoroughly studied, but thesemethods for numerical data are not suitable for categorical data that abound in the reality.Categorical data are made up of non-numerical attributes, with different clustering modelscompared to numerical data. For large scale and high-dimensional categorical data, cluster analysisis more challenging. In this thesis, we research on clustering, especially parallel clustering forcategorical data, and implement a parallel clustering algorithm which handle large-scalecategorical data effectively.Firstly, we investigate clustering algorithm, especially for categorical data, then investigateCLOPE which is a representative categorical attribute oriented clustering algorithm. CLOPEdefines a global criterion function based on histogram to judge the quality of clustering and achievegood results in clustering large, sparse transaction databases with high dimensions. However, wefind that the data order can affect the process of clustering, it is hard to stably find the global optimalclusters. To deal with this problem, we propose MRCLOPE algorithm by iterative equally dividingand then permutating the data. For each iterative process of MRCLOPE algorithm, divides the inputdataset into p section and group them into p! new datasets with permutation and combination, thenclustering each dataset,and chooses the optimal clustering according to the profit as the input of thefirst step of the next iteration.Secondly, to deal with the same intermediate results during the computation process of MRCLOPE, we put forward a result reuse strategy, by utilizing the reusable intermediate results cangreatly improve the speed of clustering.Thirdly, In order to further handle the time complexity of the process, we put forward adistributed solution that implement MRCLOPE on Hadoop using MapReduce parallel platform, theclustering tool has been released to the open source community(https://github.com/j2cms/mrclope).Finally, use three representative categorical datasets (e. g., mushroom datasets) to testMRCLOPE algorithm. Experiments show that, MRCLOPE can achieve better results than CLOPE.For the mushrooms dataset, when CLOPE achieve optimal results, MRCLOPE can achieve35.7%larger profit value than CLOPE. When dealing with big data, compare with serial MRCLOPE,parallel MRCLOPE greatly shorten the computing time, has achieved a good speedup.In this thesis, we proposed an idea that equally divide and then permutates the data, which hasreference significance to other algorithms affected by the order of the input data.
Keywords/Search Tags:categorical data, CLOPE, MRCLOPE, parallel clustering, MapReduce
PDF Full Text Request
Related items