Font Size: a A A

Parallelized Clustering Algorithm Based On The Cloud Platform

Posted on:2016-08-03Degree:MasterType:Thesis
Country:ChinaCandidate:K ChengFull Text:PDF
GTID:2308330473465512Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Clustering algorithm is an important part of data mining, extracting useful information and knowledge from the data to serve people, which has been widely applied in the industrial,commercial and research fields. With the amount of current social data dramatic increased, the computing power of stand-alone computing clustering algorithm gradually cannot meet the demands. To acquire successful business and a lot of economic benefits from the intense competition, the majority of Internet companies have sought to effective strategies to deal with large-scale data, then distributed clustering algorithm by multi-computers participated in operations becomes a focus in current researches.Cloud computing is an excellent and novel commercial computer patterns, providing superior computing power through the integration of the Internet node resources by virtualization, and in accordance with real-time changes of the amount of tasks, enabling to dynamically expand dynamic expansion nodes in the cluster computing. The system will calculate a reasonable allocation of tasks to be processed on the nodes of the computer clusters. According to the actual needs to required storage space and computing power and other resources, users can deploy the infrastructure of the cloud platform without understanding the internal knowledge and details. As open-source cloud computing platform developed by Apache, Hadoop, a software framework with distributed processing of massive data, which deals with data in an efficient, reliable and scalable way. In addition, it has some advantages such as high fault-tolerance and low cost, etc. Hadoop core designs are at the bottom of HDFS(Distributed File System) and the upper MapReduce(programming mode), to provide memory and computing for mass data, respectively.This paper major studies how to use parallel computing power of numerous computer nodes in a cloud platform to solve the large-scale data clustering. Some improvements aimed at the shortcomings of algorithm proposed by Kmeans have been put forward: considered the Canopy algorithm as the initial step of Kmeans clustering and optimized the selection of initial cluster centers based on the "min-max principle"; optimized Kmeans iterative process to reduce the whole calculated quantities and further improve the efficiency of the algorithm. After detailed analysis of the problem in DBSCAN including preferences, memory usage, I / O overhead, we come up with a hierarchy-based optimization algorithm. These not only eliminate the inference on the algorithm efficiency caused by improper parameter selection, but to some extent decrease the number of queries and further I / O overhead.Finally, we take a series of tests to verify the performance through the established MapReduce platform based on the Kmeans and DBSCAN, respectively and optimal algorithm. The results suggest that both the speed and accuracy of Kmeans optimal algorithm are improved; the accuracy and effectiveness of DBSCAN optimal algorithm are increased; and the parallel algorithm more suitable to the processing of large data by accelerated ratio test proved.
Keywords/Search Tags:clustering, cloud computing, Hadoop, Kmeans, DBSCAN
PDF Full Text Request
Related items