Research On K-Means Clustering Algorithm Based On Hadoop Cloud Computing Platform

Posted on:2018-11-02

Degree:Master

Type:Thesis

Country:China

Candidate:Y Liu

Full Text:PDF

GTID:2348330512466989

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

As one of the most popular research directions of data mining technology,clustering analysis has been always favored by researchers and developers.Clustering can divide the original data object into several clusters.The goal of the algorithm is that the similarity between data objects in the same class cluster is higher and the similarity between data objects in different clusters is lower.With mobile Internet,networking and the development of artificial intelligence,the amount of information generated by Web end is more and more huge,how to clustering analysis the large scale data efficiently and stably has become a new research topic.With the rise of Hadoop distributed cloud computing platform,it is possible to solve the performance problem of traditional serial algorithms by using multiple computing nodes for parallel computing.This paper deeply studied the Hadoop distributed cloud computing platform and clustering algorithm and other related technologies.A cluster analysis system based on Hadoop platform was designed and implemented.System was divided into three layers: respectively the underlying driver layer,the middle logic layer and the external service layer.This paper introduced the design idea and the realization process of the system in detail.The purpose was to encapsulate the specific operation of clustering analysis inside and expose simple operating interface external,so that the specific algorithm was achieved to user transparent and implement cluster analysis stably and efficiently.Through deeply analyzing of the problem in K-Means algorithm,this topic proposed an improved scheme based on Hadoop distributed platform.Using the proposed clustering analysis system to configure the experimental environment,the algorithm is optimized from three aspects: parallel random sampling,parallelization of sample distance computation and parallelization of data clustering process.At the same time,the improved K-Means parallel algorithm flow was described in detail.Finally,the improved K-Means parallel algorithm was tested in four directions: convergence rate,accuracy rate,initialization sampling rate and speedup ratio in cluster environment.The experimental result shows that the cluster analysis system based on Hadoop distributed cloud computing platform can provide efficient,stable and configurable clustering analysis service.Improved K-Means parallel clustering algorithm can quickly deal with large scale calculation of cluster analysis.

Keywords/Search Tags:

Hadoop, cloud computing, K-Means, clustering analysis

PDF Full Text Request

Related items

1	Cloud Computing-based Integratedoperation Management Platform Research
2	Research On Parallel Clustering Algorithm Based On Hadoop Cloud Computing Platform
3	Research On Parallelization Of Text Clustering Based On Hadoop Cloud Computing Platform
4	The Research Of Clustering Algorithm Based On Hadoop Cloud Computing Platform
5	Research Of Clustering Algorithm Based On Cloud Computing Platform
6	The Key Research Of Clustering Algorithm Parallelization On The Platform Of Cloud Computing
7	Clustering Algorithm Based On The Background Of Big Data
8	Research On Data Mining Technology Of Internet Of Things Based On Cloud Computing
9	Research On Machine Learning Clustering Algorithms In The Hadoop Development Environment
10	K-Means Algorithm Design And Implementation Based On Hadoop And Mahout