Font Size: a A A

Parallelizing K-means-based Clustering On Spark

Posted on:2019-12-31Degree:MasterType:Thesis
Country:ChinaCandidate:B W WangFull Text:PDF
GTID:2428330572455300Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Cluster analysis has always been one of the core research topics in the field of data mining.It not only received long-term sustained attention from researchers at home and abroad,but also be applied in many areas of industry.Among a large number of clustering methods,the K-means algorithm has become the most basic and widely used method because of its advantages such as simplicity,high efficiency,robust clustering results,easy interpretation,and wide adaptation.With the deepening of research,the K-means method system has become increasingly large,forming a large number of Kmeans method variants,such as Info-Kmeans suitable for high-dimensional sparse text clustering,fuzzy c-means,consensus clustering method,etc.These methods are using the two-stage iterative process of K-means: distance calculation and updating cluster centers.The arrival of the era of big data makes single-machine clustering algorithms difficult to meet the requirements of big data clustering.Therefore,how to use distributed computing technology to improve the scalability of big data clustering becomes an important issue.This article focuses on how to design and implement a universal distributed clustering framework for the K-means method system.The framework needs to have two characteristics: First,it can support the mainstream algorithms of the K-means method system and it can also be easily incorporated into other new ones based on k-means algorithm.Second,it can cluster large-scale data efficiently,that is,it has high scalability.The main research work of this article is summarized as follows:1.Based on the Spark memory framework,we proposed a universal distributed clustering framework applicable to the K-means method system.We redesigned the various stages of the framework to enhance the adaptability and extensibility of the framework,including: adapting loading data module s to high-dimensional sparse data,and making the distance calculation module supports K-means multiple distance functions to meet the needs of clustering different types of data.2.Fuzzy c-means needs to use the membership degree matrix to update the cluster centers,so the computation method is different from K-means.This paper designed the computational framework of the fuzzy c-means on spark platform,which makes it highly adaptable and extensible.For example,distance calculation module supports a variety of fuzzy c-means distance functions.The updating membership matrix part is set in the map function of distance computation and the advantages of the distributed framework make the algorithm more efficient.3.We proposed a k-means-based consensus clustering algorithm on spark platform,including hard consensus clustering and fuzzy consensus clustering.K-meansbased consensus clustering and K-means algorithm are different in the initialization and distance calculation steps,so this paper should design a reasonable initialization and distance calculation method according to the characteristics of Spark distributed computing framework.4.The experimental results of the k-means method in this paper based on Spark memory framework have proved the feasibility of clustering effect and execution efficiency on large UCI data sets and text data sets.On KDDCUP,a dataset with millions of instance,our clustering efficiency is close to CLUTO,faster than MLlib.At the same time,our algorithm on high-dimensional data text data sets such as the Weibo data set also have achieved good clustering results.
Keywords/Search Tags:Cluster Ensemble, Big Data, Kmeans, Fuzzy c means, Spark
PDF Full Text Request
Related items