Font Size: a A A

Fuzzy C-means And K-means Clustering Algorithm And Its Parallel

Posted on:2014-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:S Y LiaoFull Text:PDF
GTID:2248330395491772Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Clustering, an important research fields in data mining, can be divided intotwo kinds of algorithm: soft clustering algorithm and hard clustering algorithm.The fuzzy c-means clustering is a classic soft clustering algorithm, while thek-means clustering is a classical hard clustering algorithm. Both algorithm hasbeen widely used in pattern recognition, image processing, medical research, etc.the paper focus on the research of the poor anti-noise performance of fuzzyc-means clustering algorithm and k-means clustering algorithm, the objectivefunction is easy to fall into local extreme when initial cluster centers selected isthe noise data. The main points of our work can be summarized as follows:(1) A fuzzy c-means clustering algorithm is introduced, which solved thepoor anti-noise performance problems of the fuzzy c-means clustering algorithmby using fuzzy entropy. By introducing the fuzzy entropy, objective function isredefined firstly, the new objective function is derived secondly, the newsolution formulas of membership degree is presented with Gaussian distributioncharacteristics, and finally the algorithm effectively avoid the influence of noisedata to cluster centers With UCI data set, the experimental results validate thealgorithm can effectively improves accuracy and noise resistance of fuzzyclustering.(2) A K-means clustering algorithm based on the average and H weights ispresented, which amide to solved the problem of the more the number ofiterations and easier to fall into local extreme when the K-means clusteringalgorithm easy to choose the noise data. firstly, according to the mean theory, anew method of selecting initial cluster centers is introduced, which effectivelyreduces the number of iterations of the algorithm, avoids the defect of easy tofall into local extreme of the algorithm. Secondly, according to the differentinfluence degree of each sample of the data set to clustering, the Euclideandistance is redefined by introducing H weights and the iteration formula of thek-means clustering algorithm. In the end, with UCI data set, the experimentalresults indicate the accuracy and noise resistance of the algorithm. (3) A k-means parallel algorithm based on the average and H weights ispresented under cluster environment. Firstly, the k-means algorithm assigns thedata set to each node in horizontal way and the same part of data subset iscalculated to summation on each node and their results were uploaded to servicenode. Service node calculates the initial cluster centers according to the meanmethod and transfers its results to each node. Secondly each node classifies datasubset by using the H weights iteration formula of k-means clustering algorithmand uploads their results to service node to gather and categorize, new clustercenters is computed and transferred to each node, until more than presetmaximum number of iterations, or objective function value is smaller than acertain threshold. In the end, the experimental results show that the parallelalgorithm has a good speedup and scalability and data expansion ratio in hadoopcloud computing platform with the massive celestial spectrum data.
Keywords/Search Tags:Fuzzy c-means clustering algorithm, fuzzy entropy, membership, regulatory factor, K-means clustering algorithm, H weights, Average, Europeandistance, clustering center, Parallelization
PDF Full Text Request
Related items