Research Of Clustering Algorithm Based On Cloud Computing Platform

Posted on:2015-02-11

Degree:Master

Type:Thesis

Country:China

Candidate:M Yao

Full Text:PDF

GTID:2298330452950753

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Clustering algorithm has always been one of the most important branch in datamining algorithm. Without prior knowledge, clustering algorithm can helpresearchers get the regular pattern and specific organizational structure of the objectform dataset. With the development of technology, the amount of data which iscontained in the dataset grows exponentially. The traditional model of clusteranalysis algorithm has been insufficient to deal with the current data size. Newlypresented distributed platforms such as Hadoop, Spark,provides a new direction forthe development and research of cluster analysis. Meanwhile, clustering algorithmhas become a research emphasis.To deals with the problem that the traditional clustering algorithm can’t handlebig data clustering efficiently, this thesis researches and optimizes the clusteringalgorithm, then brings in cloud computing scheme. The main work of this thesis is asfollows:(1)Firstly, this thesis makes a deep analysis of K-means algorithm which is aclassic partition-based algorithm. The features and implementation of K-meansalgorithm is introduced. Then this thesis elaborates several shortcomings of K-meansalgorithm. Based on these shortcomings, a solution of preprocessing the dataset toderive the initial k value and initial cluster centers of K-means algorithm is proposed.The algorithm is improved from the perspective of optimizing the initial value.Partition-based algorithm have a problem that is sensitive to the shape of the dataset,so this thesis analyzes and improves a density-based algorithm, DBSCAN. Theimproved DBSCAN algorithm reduces the time consumption to some extent.(2) To solve the problem that the traditional model of clustering algorithm isdifficult to handle large data sets, this thesis makes a study of the MapReduceprogramming model and makes a parallel design of the improved algorithm inMapReduce framework at Hadoop.(3)Through comparative experiments which compares the characteristics of thetwo algorithms in the processing of dealing with any shape data set, this thesisproves that the the K-means algorithm with optimizing the initial value is better thanthe original K-means algorithm on the aspects of clustering results and algorithm complexity and the improved DBSCAN algorithm reduces the time consumption.This thesis also proves that the two parallelized algorithm can fully reflect theadvantages of distributed computing, which greatly reduces the calculating time andmakes data processing efficiency greatly improved.

Keywords/Search Tags:

Clustering algorithm, Hadoop, MapReduce, K-means, DBSCAN

PDF Full Text Request

Related items

1	Research And Implementation Of Clustering Algorithm Based On Hadoop Platform
2	Application And Research Of DBSCAN Based On Hadoop Platform
3	Parallel Clustering Algorithm Based On MapReduce
4	The Research And Application Of Security Log Clustering Mining Algorithm Based On Hadoop Platform
5	Research And Implementation Of Parallel Clustering Algorithm Based On Approximate Spectrum Hadoop MapReduce
6	Research On Parallel Clustering Algorithm On Hadoop Platform
7	The Research Of Data Optimization And Application Of Clustering Algorithm Based On Hadoop
8	The Clustering Algorithm Based On Hadoop Parallel Analysis And Applied Research
9	Research On Parallel Clustering Algorithm For Large - Scale Data Set
10	Analysis And Research On Parallel Clustering Algorithm Based On Hadoop