Font Size: a A A

Research On K-medoids Clustering Algorithm Based On Spark

Posted on:2019-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:Z J ZangFull Text:PDF
GTID:2348330548462252Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the deep integration of the Internet and traditional industries,data is now growing in spurt.In this age when data is king,people are increasingly aware of the importance of information contained in massive data for our work and life.Guidance role.It is a very valuable research topic to rapidly discover useful knowledge from these massive data and use it to guide life and production.Obviously,the traditional clustering algorithm deals with a relatively small scale of data,but the clustering effect is still good,but when dealing with massive data,it is clear that the traditional serial algorithm can not complete the task quickly,and with the increase of data volume,the operating speed becomes slower.The main task of this paper is to study the advantages and disadvantages of the traditional Canopy algorithm and K-medoids algorithm.The Canopy algorithm can quickly achieve coarse clustering and quickly obtain several Canopy centers.The K-medoids algorithm is robust to noise.Well,but you need to specify the K value in advance,then you can use the Canopy center point as the initial clustering center of the K-medoids algorithm.Experiments show that this scheme is feasible.However,when dealing with massive amounts of time,these two algorithms are incompetent.Therefore,the traditional clustering algorithms are parallelized on the big data platform.First,the traditional two algorithms are combined to implement the MapReduce programming model on the Hadoop platform.Parallelization(HCKM algorithm),although it can handle massive amounts of data quickly to a certain extent,but in the actual problem need to deal with multiple iterations of data,performance becomes unsatisfactory.Second,the RDD Transformation and Action operations on the Spark platform are used to parallelize the two algorithms(SCKM algorithm),and the convergence speed and the stability of the result can be obtained.This article will deploy the Canopy-K-medoids algorithm to run on Hadoop clusters and Spark clusters.The traditional K-medoids algorithm,HCKM algorithm and SCKM algorithm were tested respectively,and the comparisons were made in terms of speedup,accuracy,etc.Finally,the improved Canopy-Kmedoids algorithm based on Spark(SCKM algorithm)was verified to have good performance.The faster,more stable processing requires massive amounts of iterative data,and the processed data also has better accuracy.
Keywords/Search Tags:Spark, parallelization, clustering, big data
PDF Full Text Request
Related items