Research On Clustering Algorithm On Hadoop Platform

Posted on:2017-11-14

Degree:Master

Type:Thesis

Country:China

Candidate:X Wan

Full Text:PDF

GTID:2348330488957690

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet and rapidly growing amount of data. When the traditional clustering algorithms deal with big data, it is difficult to achieve the requirements. The cloud computing platform emerges as required, it evolves from parallel computing.Currently, Hadoop is the most widely used cloud computing platform. The cloud computing applications with distributed, heterogeneous and other features are suitable for large data processing.The model of Map Reduce is the core module Hadoop cloud computing platform.which is the most widely used to improve the efficiency of clustering algorithms based, mapreduce model; As data volumes increment, processing large data based cloud computing platform has become a hot spot.The data mining algorithms research gradually become a hot topic based on cloud computing platform. For now though mainly including how to achieve the parallel of the traditional clustering algorithms,or clustering algorithm based cloud platform, the performance evaluation is mainly speedup.This paper mainly completed the following work:Firstly, by analyzing the Canopy-kmeans algorithm and its shortcomings,the paper proposes an improved algorithm. The algorithm reduces the running time by grouping and sampling. By improving the Canopy algorithm though the principle of maximum and minimum,the algorithm improves the effectiveness.The simulation results of MATLAB show that: timeliness and effectiveness of the algorithm are improved. The paper achieves improved Canopy-kmeans algorithm on Hadoop platform, the results show that with the linear increase in the number of nodes, the speedup increases linearly, it efficiently handle large data.Secondly, CFSFDP is a new density-based clustering algorithm. By analyzing the algorithm and its shortcomings, the paper proposes an improved algorithm:R-CFSFDP.Firstly, the algorithm selects part of the data; Secondly, finding M-value of the part of data;Thirdly, Clustering data based on M-value. The simulation results of MATLAB show that:R-CFSFDP greatly reduces the running time of the algorithm.Although R-CFSFDP reduces the running time of the algorithm, the simulation results show the effectiveness of the improved algorithm has declined and R-CFSFDP can not effectively be combined with cloud platform. By analyzing the shortcomings of the above, this paper proposes an improved algorithm:n-CFSFDP. Firstly, the algorithm groups data; Secondly, Clustering the data with CFSFDP algorithm; Finally, Combining clustering results based on M-value.Unlike R-CFSFDP, n-CFSFDP can handle all of the data, so it has better effectiveness. The simulation results of MATLAB show that: n-CFSFDP greatly reduces the running time of the algorithm and has better effectiveness.The paper achieves improved n-CFSFDP on Hadoop platform, the results show that with ther increase in the number of nodes, the speedup increases rapidly, it can efficiently handle large data.

Keywords/Search Tags:

cluster algorithms, cloud computing platform, Canopy, k-means, CFSFDP

PDF Full Text Request

Related items

1	The Research Of Clustering Algorithm Based On Hadoop Cloud Computing Platform
2	Research On Parallel Clustering Algorithm Based On Hadoop Cloud Computing Platform
3	Cloud Computing-based Integratedoperation Management Platform Research
4	The Design And Implementation Of Parallelization Of Canopy And FCM Clustering Algorithms On Spark Platform
5	Research On Parallelization Of Data Mining Algorithm Based On Distributed Platforms Spark And YARN
6	Research And Application Of Virtual Machine Anomaly Detection Technology In Cloud Platform
7	Design And Implementation Of Cloud Mnitoring System Based On Cluster Server
8	The Key Research Of Clustering Algorithm Parallelization On The Platform Of Cloud Computing
9	Research On The Construction Of Cloud Platform Based On Docker And The Optimization Technology Of Cluster Management
10	Research On K-Means Clustering Algorithm Based On Hadoop Cloud Computing Platform