Font Size: a A A

Research Of Clustering Algorithm Based On Cloud Computing Platform

Posted on:2015-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:M YaoFull Text:PDF
GTID:2298330452950753Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering algorithm has always been one of the most important branch in datamining algorithm. Without prior knowledge, clustering algorithm can helpresearchers get the regular pattern and specific organizational structure of the objectform dataset. With the development of technology, the amount of data which iscontained in the dataset grows exponentially. The traditional model of clusteranalysis algorithm has been insufficient to deal with the current data size. Newlypresented distributed platforms such as Hadoop, Spark,provides a new direction forthe development and research of cluster analysis. Meanwhile, clustering algorithmhas become a research emphasis.To deals with the problem that the traditional clustering algorithm can’t handlebig data clustering efficiently, this thesis researches and optimizes the clusteringalgorithm, then brings in cloud computing scheme. The main work of this thesis is asfollows:(1)Firstly, this thesis makes a deep analysis of K-means algorithm which is aclassic partition-based algorithm. The features and implementation of K-meansalgorithm is introduced. Then this thesis elaborates several shortcomings of K-meansalgorithm. Based on these shortcomings, a solution of preprocessing the dataset toderive the initial k value and initial cluster centers of K-means algorithm is proposed.The algorithm is improved from the perspective of optimizing the initial value.Partition-based algorithm have a problem that is sensitive to the shape of the dataset,so this thesis analyzes and improves a density-based algorithm, DBSCAN. Theimproved DBSCAN algorithm reduces the time consumption to some extent.(2) To solve the problem that the traditional model of clustering algorithm isdifficult to handle large data sets, this thesis makes a study of the MapReduceprogramming model and makes a parallel design of the improved algorithm inMapReduce framework at Hadoop.(3)Through comparative experiments which compares the characteristics of thetwo algorithms in the processing of dealing with any shape data set, this thesisproves that the the K-means algorithm with optimizing the initial value is better thanthe original K-means algorithm on the aspects of clustering results and algorithm complexity and the improved DBSCAN algorithm reduces the time consumption.This thesis also proves that the two parallelized algorithm can fully reflect theadvantages of distributed computing, which greatly reduces the calculating time andmakes data processing efficiency greatly improved.
Keywords/Search Tags:Clustering algorithm, Hadoop, MapReduce, K-means, DBSCAN
PDF Full Text Request
Related items