Font Size: a A A

The Design And Implementation Of Parallelization Of Canopy And FCM Clustering Algorithms On Spark Platform

Posted on:2020-01-24Degree:MasterType:Thesis
Country:ChinaCandidate:X X LiFull Text:PDF
GTID:2518305741980449Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
The rapid development of Internet technology and the advent of the 5G era have produced a large amount of data that needs to be processed,and there is a lot of valuable wealth hidden in these huge amounts of data.More and more companies and scholars are beginning to pay attention to and study how to extract information useful to people from these huge amounts of data.In order to solve this problem,people use clusters to process data in parallel,which significantly speeds up the processing power of huge amounts of data.Clustering algorithm is one of the commonly used algorithms when people process data.The parallel clustering algorithm in Spark platform can solve the clustering problem in big data environment.However,there are only four core clustering algorithms in the Spark platform,which can not perfectly cope with increasingly complex clustering scenarios.It is necessary to develop a new clustering algorithm for the Spark platform.Canopy algorithm and FCM(Fuzzy C-means)algorithm are clustering algorithms that are often used in clustering scenarios,but the traditional Canopy algorithm and FCM algorithm are serial stand-alone operation,so it is difficult to handle huge processing in big data environment.Therefore,this topic studies the parallel design and implementation of Canopy algorithm and FCM algorithm in Spark platform.Firstly,taking full advantage of the characteristics and advantages of DataFrame in Spark distributed platform and taking into account various factors including memory optimization,data compression and 10 communication consumption,etc.,the Canopy and FCM parallelization algorithm in Spark platform was designed to greatly improve the computing capacity under massive data.Secondly,in view of the fact that the FCM membership matrix consumes a lot of IO communication time when the data volume is very large,a distributed algorithm based on the improved membership matrix is proposed,which solves the short board with too much 10 communication time.Finally,since the FCM algorithm has the problem of initial K value uncertainty and initial cluster center point instability,the Canopy coarse clustering algorithm is combined to provide the K value and the initial cluster center point for the FCM clustering algorithm,which improves the usability of the FCM algorithms.From the experimental test results of Canopy and FCM parallel clustering algorithm in Spark platform,we can get the following conclusions:(1)The Canopy parallel clustering algorithm in Spark platform was successfully developed to achieve parallelization on Spark cluster,which has good scalability.(2)The FCM parallel clustering algorithm in Spark platform was successfully developed,and the problem of large communication time overhead of FCM algorithm distributed operation was solved by the improved membership matrix.(3)The developed Canopy+FCM algorithm successfully solved the K value uncertainty and the initial cluster center point instability problem,making the FCM algorithm more stable.
Keywords/Search Tags:Canopy, FCM, Distributed computing, Parallelization, Spark
PDF Full Text Request
Related items