Parallelized Clustering Algorithm Based On The Cloud Platform

Posted on:2016-08-03

Degree:Master

Type:Thesis

Country:China

Candidate:K Cheng

Full Text:PDF

GTID:2308330473465512

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

Clustering algorithm is an important part of data mining, extracting useful information and knowledge from the data to serve people, which has been widely applied in the industrial,commercial and research fields. With the amount of current social data dramatic increased, the computing power of stand-alone computing clustering algorithm gradually cannot meet the demands. To acquire successful business and a lot of economic benefits from the intense competition, the majority of Internet companies have sought to effective strategies to deal with large-scale data, then distributed clustering algorithm by multi-computers participated in operations becomes a focus in current researches.Cloud computing is an excellent and novel commercial computer patterns, providing superior computing power through the integration of the Internet node resources by virtualization, and in accordance with real-time changes of the amount of tasks, enabling to dynamically expand dynamic expansion nodes in the cluster computing. The system will calculate a reasonable allocation of tasks to be processed on the nodes of the computer clusters. According to the actual needs to required storage space and computing power and other resources, users can deploy the infrastructure of the cloud platform without understanding the internal knowledge and details. As open-source cloud computing platform developed by Apache, Hadoop, a software framework with distributed processing of massive data, which deals with data in an efficient, reliable and scalable way. In addition, it has some advantages such as high fault-tolerance and low cost, etc. Hadoop core designs are at the bottom of HDFS(Distributed File System) and the upper MapReduce(programming mode), to provide memory and computing for mass data, respectively.This paper major studies how to use parallel computing power of numerous computer nodes in a cloud platform to solve the large-scale data clustering. Some improvements aimed at the shortcomings of algorithm proposed by Kmeans have been put forward: considered the Canopy algorithm as the initial step of Kmeans clustering and optimized the selection of initial cluster centers based on the "min-max principle"; optimized Kmeans iterative process to reduce the whole calculated quantities and further improve the efficiency of the algorithm. After detailed analysis of the problem in DBSCAN including preferences, memory usage, I / O overhead, we come up with a hierarchy-based optimization algorithm. These not only eliminate the inference on the algorithm efficiency caused by improper parameter selection, but to some extent decrease the number of queries and further I / O overhead.Finally, we take a series of tests to verify the performance through the established MapReduce platform based on the Kmeans and DBSCAN, respectively and optimal algorithm. The results suggest that both the speed and accuracy of Kmeans optimal algorithm are improved; the accuracy and effectiveness of DBSCAN optimal algorithm are increased; and the parallel algorithm more suitable to the processing of large data by accelerated ratio test proved.

Keywords/Search Tags:

clustering, cloud computing, Hadoop, Kmeans, DBSCAN

PDF Full Text Request

Related items

1	Research Of Clustering Algorithm Based On Cloud Computing Platform
2	Analysis And Research On Parallel Clustering Algorithm Based On Hadoop
3	The Bad Data Identification Of Power System Based On Cloud Computing And Improved KMeans
4	Reach On Map-Reduce Application Based On Hadoop
5	Reach On Map-reduce Application Based On Hadoop
6	The Research Of Parallel Clustering Algorithm Based On Hadoop Platform
7	Research On K-Means Clustering Algorithm Based On Hadoop Cloud Computing Platform
8	Application Of Improved Clustering Algorithm Based On Hadoop In Web Log Clustering
9	Research On Safe Outsourcing Algorithms For Hermite Normal Form And DBSCAN Clustering
10	Kmeans Analysis Of Massive Book Circulation Data Based On Hadoop