Research On K-medoids Clustering Algorithm Based On Spark

Posted on:2019-02-12

Degree:Master

Type:Thesis

Country:China

Candidate:Z J Zang

Full Text:PDF

GTID:2348330548462252

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the deep integration of the Internet and traditional industries,data is now growing in spurt.In this age when data is king,people are increasingly aware of the importance of information contained in massive data for our work and life.Guidance role.It is a very valuable research topic to rapidly discover useful knowledge from these massive data and use it to guide life and production.Obviously,the traditional clustering algorithm deals with a relatively small scale of data,but the clustering effect is still good,but when dealing with massive data,it is clear that the traditional serial algorithm can not complete the task quickly,and with the increase of data volume,the operating speed becomes slower.The main task of this paper is to study the advantages and disadvantages of the traditional Canopy algorithm and K-medoids algorithm.The Canopy algorithm can quickly achieve coarse clustering and quickly obtain several Canopy centers.The K-medoids algorithm is robust to noise.Well,but you need to specify the K value in advance,then you can use the Canopy center point as the initial clustering center of the K-medoids algorithm.Experiments show that this scheme is feasible.However,when dealing with massive amounts of time,these two algorithms are incompetent.Therefore,the traditional clustering algorithms are parallelized on the big data platform.First,the traditional two algorithms are combined to implement the MapReduce programming model on the Hadoop platform.Parallelization(HCKM algorithm),although it can handle massive amounts of data quickly to a certain extent,but in the actual problem need to deal with multiple iterations of data,performance becomes unsatisfactory.Second,the RDD Transformation and Action operations on the Spark platform are used to parallelize the two algorithms(SCKM algorithm),and the convergence speed and the stability of the result can be obtained.This article will deploy the Canopy-K-medoids algorithm to run on Hadoop clusters and Spark clusters.The traditional K-medoids algorithm,HCKM algorithm and SCKM algorithm were tested respectively,and the comparisons were made in terms of speedup,accuracy,etc.Finally,the improved Canopy-Kmedoids algorithm based on Spark(SCKM algorithm)was verified to have good performance.The faster,more stable processing requires massive amounts of iterative data,and the processed data also has better accuracy.

Keywords/Search Tags:

Spark, parallelization, clustering, big data

PDF Full Text Request

Related items

1	Research And Application Of Big Data Clustering Algorithm Based On Spark Platform
2	Research On K-medoids Clustering Algorithm Based On Spark
3	Research On Parallelization Of Data Stream Clustering Algorithm For Police Data
4	The Design And Implementation Of Parallelization Of Canopy And FCM Clustering Algorithms On Spark Platform
5	The Parallelization And Optimization Of K-means Algorithm Based On Spark
6	Research On Cluster Analysis Technology Of Component Size Measurement Data Based On Spark
7	Research And Implementation Of Classification Algorithm Parallelization Based On Spark
8	Research And Implementation Of Large-Scale And Efficient Clustering Algorithm Based On Spark
9	The Optimization Of Clustering And Classification Algorithms Based On SPARK
10	Implementation And Application Of Clustering Algorithm Based On Spark