Font Size: a A A

An Advanced Partition Clustering And Parallelization On Cluster Environment

Posted on:2012-11-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y L SuFull Text:PDF
GTID:2178330335970427Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Clustering analysis as an important task of data mining has been concerned by academia and industry extensively. Because of simple thinking, high time efficiency and good scalability for large scale data, the partition clustering algorithm has been applied in various fields, such as image processing, customer analysis, biological genetic research and text mining etc. Consequently, the study of algorithm of partition clustering method is theoretical and practical.In this subject, after research of classical K-means algorithm, we discussed the characteristics and shortage of this algorithm and make the further research of existing problem purposively, of which are the quality of clustering results, the efficiency of the algorithm and large scale data processing. Finally we proposed an improved and effective solution.The main research content and the fruits of the work are as following. Firstly, aiming at the dependence of the initial value of K-means algorithm, we put forward an effective strategy for selecting the initial cluster centers. Secondly, aiming at the problem of which the noises have impact on clustering results, we designed a method to remove noises so as to avoid impact of individual noise data on the clustering results, and ensured the criterion function convergence reasonably, meanwhile, reduce the number of iterations. Thus, the method made a greater improvement of the results and the executive efficiency of clustering analysis. Thirdly, after comprehensive considering the selecting centers and preprocessing of removing noises, we proposed an improved partition clustering analysis algorithm named GK-means. Theoretical analysis shows that the computing efficiency of the improved algorithm depends on the data size and the number of gird cells. At the same time, we designed an experiment to illustrate the clustering results and the efficiency of the improved algorithm, the experiment results demonstrated that the improved algorithm had more advantage than existed algorithm. Fourthly, in order to make the improved algorithm adapt for large scale data, we carried out the paralleled research in cluster environment. We made paralleled designs of initializing cluster centers and the clustering analysis, and proposed an innovative algorithm named PGK-means which improved the performance of the clustering algorithm greatly.
Keywords/Search Tags:partition clustering, cluster environment, clustering analysis, grid technique, K-means
PDF Full Text Request
Related items