Font Size: a A A

Research On Distributed Clustering Algorithm Based On MapReduce

Posted on:2018-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:G B LiFull Text:PDF
GTID:2358330518961611Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Cluster analysis is one of the most basic data analysis techniques in data mining,it has been widely used in economics,social sciences and computer science.However,with the rapid development of Internet technology,a variety of data generated by network applications increased dramatically,which has brought great technical challenges to the method of traditional clustering analysis.How to obtain valuable information from the massive data quickly and effectively becomes an urgent problem that many industries need to settlement.Thematurity of cloud computing technology makes it possible to process massive amounts of data quickly and efficiently.Hadoop is an open source and distributed cloud computing platform,the core design of Hadoop is the Distributed File System(HDFS)and MapReduce,in which the HDFS provides massive data storage,MapReduceprogramming model provides data parallel processing.Compared with the traditional parallel programming model,the MapReduce programming model encapsulates the details of the underlying data segmentation,task schedulingand parallel processing,and the user can develop the distributed application without understanding the distributed underlying details,which greatly facilitates the Parallelization of program designing.As the most classic algorithm,K-means algorithm has been applied inmany industry fields for clustering analysis.However,with the increase of data size,the number of iterations of the algorithm will increase obviously,which will affect the efficiency of the algorithm.In order to make it better for the large-scale's data's clustering analysis,this paper firstly implements the parallelization of the algorithm on Hadoop platform according to the programming principle of MapReduce,and then proposed an improved algorithm according to the blindness of randomly selecting the cluster center and the problem of clustering results are easily to fall into the local optimum.The main work of the paper is as follows:(1)Based on the analysis of the traditional K-means algorithm and the idea of maximum and minimum distance,a K-means parallelization algorithm based on maximum and minimum distance is proposed.According to the idea of maximum and minimum distance,the cluster center is selected as the initial center point of the K-means algorithm,which avoids the problem that the center point is easily to be selected closely to each other,thus improving the quality of theclustering results.In order to improve its efficiency,the parallelization of the algorithm is designed and implemented.(2)Analyzed the principle of One-pass Cluster algorithm and its advantages and disadvantages,combined with the characteristics of traditional K-means algorithm,the OPKMEANS parallelization algorithm is proposed,with its simple and efficiency feature,the One-pass Cluster algorithm is used to cluster the data quickly and “roughly”,and then use center point as the initial center point of the K-means algorithm to avoid the blindness of randomly selecting the cluster center,and reduce the number of iterations of the K-means algorithm to keep down the data transfer overhead of the parallelization process,so as to improve the efficiency of the algorithm.(3)In order to verify the effectiveness of the improved algorithm,based on the study of Hadoop principle,a Hadoop distributed computing platform was built on the virtual machine,and multiple experiments were carried out.The superiority of the algorithm was verified by the quality of clustering,speedup and scalability.
Keywords/Search Tags:Clustering algorithm, MapReduce, K-means, Canopy, parallel algorithm
PDF Full Text Request
Related items