Font Size: a A A

Research On Clustering Algorithm On Hadoop Platform

Posted on:2017-11-14Degree:MasterType:Thesis
Country:ChinaCandidate:X WanFull Text:PDF
GTID:2348330488957690Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and rapidly growing amount of data. When the traditional clustering algorithms deal with big data, it is difficult to achieve the requirements. The cloud computing platform emerges as required, it evolves from parallel computing.Currently, Hadoop is the most widely used cloud computing platform. The cloud computing applications with distributed, heterogeneous and other features are suitable for large data processing.The model of Map Reduce is the core module Hadoop cloud computing platform.which is the most widely used to improve the efficiency of clustering algorithms based, mapreduce model; As data volumes increment, processing large data based cloud computing platform has become a hot spot.The data mining algorithms research gradually become a hot topic based on cloud computing platform. For now though mainly including how to achieve the parallel of the traditional clustering algorithms,or clustering algorithm based cloud platform, the performance evaluation is mainly speedup.This paper mainly completed the following work:Firstly, by analyzing the Canopy-kmeans algorithm and its shortcomings,the paper proposes an improved algorithm. The algorithm reduces the running time by grouping and sampling. By improving the Canopy algorithm though the principle of maximum and minimum,the algorithm improves the effectiveness.The simulation results of MATLAB show that: timeliness and effectiveness of the algorithm are improved. The paper achieves improved Canopy-kmeans algorithm on Hadoop platform, the results show that with the linear increase in the number of nodes, the speedup increases linearly, it efficiently handle large data.Secondly, CFSFDP is a new density-based clustering algorithm. By analyzing the algorithm and its shortcomings, the paper proposes an improved algorithm:R-CFSFDP.Firstly, the algorithm selects part of the data; Secondly, finding M-value of the part of data;Thirdly, Clustering data based on M-value. The simulation results of MATLAB show that:R-CFSFDP greatly reduces the running time of the algorithm.Although R-CFSFDP reduces the running time of the algorithm, the simulation results show the effectiveness of the improved algorithm has declined and R-CFSFDP can not effectively be combined with cloud platform. By analyzing the shortcomings of the above, this paper proposes an improved algorithm:n-CFSFDP. Firstly, the algorithm groups data; Secondly, Clustering the data with CFSFDP algorithm; Finally, Combining clustering results based on M-value.Unlike R-CFSFDP, n-CFSFDP can handle all of the data, so it has better effectiveness. The simulation results of MATLAB show that: n-CFSFDP greatly reduces the running time of the algorithm and has better effectiveness.The paper achieves improved n-CFSFDP on Hadoop platform, the results show that with ther increase in the number of nodes, the speedup increases rapidly, it can efficiently handle large data.
Keywords/Search Tags:cluster algorithms, cloud computing platform, Canopy, k-means, CFSFDP
PDF Full Text Request
Related items