Font Size: a A A

Research On K-Means Clustering Algorithm Based On Hadoop Cloud Computing Platform

Posted on:2018-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2348330512466989Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
As one of the most popular research directions of data mining technology,clustering analysis has been always favored by researchers and developers.Clustering can divide the original data object into several clusters.The goal of the algorithm is that the similarity between data objects in the same class cluster is higher and the similarity between data objects in different clusters is lower.With mobile Internet,networking and the development of artificial intelligence,the amount of information generated by Web end is more and more huge,how to clustering analysis the large scale data efficiently and stably has become a new research topic.With the rise of Hadoop distributed cloud computing platform,it is possible to solve the performance problem of traditional serial algorithms by using multiple computing nodes for parallel computing.This paper deeply studied the Hadoop distributed cloud computing platform and clustering algorithm and other related technologies.A cluster analysis system based on Hadoop platform was designed and implemented.System was divided into three layers: respectively the underlying driver layer,the middle logic layer and the external service layer.This paper introduced the design idea and the realization process of the system in detail.The purpose was to encapsulate the specific operation of clustering analysis inside and expose simple operating interface external,so that the specific algorithm was achieved to user transparent and implement cluster analysis stably and efficiently.Through deeply analyzing of the problem in K-Means algorithm,this topic proposed an improved scheme based on Hadoop distributed platform.Using the proposed clustering analysis system to configure the experimental environment,the algorithm is optimized from three aspects: parallel random sampling,parallelization of sample distance computation and parallelization of data clustering process.At the same time,the improved K-Means parallel algorithm flow was described in detail.Finally,the improved K-Means parallel algorithm was tested in four directions: convergence rate,accuracy rate,initialization sampling rate and speedup ratio in cluster environment.The experimental result shows that the cluster analysis system based on Hadoop distributed cloud computing platform can provide efficient,stable and configurable clustering analysis service.Improved K-Means parallel clustering algorithm can quickly deal with large scale calculation of cluster analysis.
Keywords/Search Tags:Hadoop, cloud computing, K-Means, clustering analysis
PDF Full Text Request
Related items