Parallelizing K-means-based Clustering On Spark

Posted on:2019-12-31

Degree:Master

Type:Thesis

Country:China

Candidate:B W Wang

Full Text:PDF

GTID:2428330572455300

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Cluster analysis has always been one of the core research topics in the field of data mining.It not only received long-term sustained attention from researchers at home and abroad,but also be applied in many areas of industry.Among a large number of clustering methods,the K-means algorithm has become the most basic and widely used method because of its advantages such as simplicity,high efficiency,robust clustering results,easy interpretation,and wide adaptation.With the deepening of research,the K-means method system has become increasingly large,forming a large number of Kmeans method variants,such as Info-Kmeans suitable for high-dimensional sparse text clustering,fuzzy c-means,consensus clustering method,etc.These methods are using the two-stage iterative process of K-means: distance calculation and updating cluster centers.The arrival of the era of big data makes single-machine clustering algorithms difficult to meet the requirements of big data clustering.Therefore,how to use distributed computing technology to improve the scalability of big data clustering becomes an important issue.This article focuses on how to design and implement a universal distributed clustering framework for the K-means method system.The framework needs to have two characteristics: First,it can support the mainstream algorithms of the K-means method system and it can also be easily incorporated into other new ones based on k-means algorithm.Second,it can cluster large-scale data efficiently,that is,it has high scalability.The main research work of this article is summarized as follows:1.Based on the Spark memory framework,we proposed a universal distributed clustering framework applicable to the K-means method system.We redesigned the various stages of the framework to enhance the adaptability and extensibility of the framework,including: adapting loading data module s to high-dimensional sparse data,and making the distance calculation module supports K-means multiple distance functions to meet the needs of clustering different types of data.2.Fuzzy c-means needs to use the membership degree matrix to update the cluster centers,so the computation method is different from K-means.This paper designed the computational framework of the fuzzy c-means on spark platform,which makes it highly adaptable and extensible.For example,distance calculation module supports a variety of fuzzy c-means distance functions.The updating membership matrix part is set in the map function of distance computation and the advantages of the distributed framework make the algorithm more efficient.3.We proposed a k-means-based consensus clustering algorithm on spark platform,including hard consensus clustering and fuzzy consensus clustering.K-meansbased consensus clustering and K-means algorithm are different in the initialization and distance calculation steps,so this paper should design a reasonable initialization and distance calculation method according to the characteristics of Spark distributed computing framework.4.The experimental results of the k-means method in this paper based on Spark memory framework have proved the feasibility of clustering effect and execution efficiency on large UCI data sets and text data sets.On KDDCUP,a dataset with millions of instance,our clustering efficiency is close to CLUTO,faster than MLlib.At the same time,our algorithm on high-dimensional data text data sets such as the Weibo data set also have achieved good clustering results.

Keywords/Search Tags:

Cluster Ensemble, Big Data, Kmeans, Fuzzy c means, Spark

PDF Full Text Request

Related items

1	Research On Spark Oriented Fuzzy C-means Clustering Algorithm
2	Theoretical And Applied Research On Fuzzy C-means Clusteirng And Its Cluster Validation
3	The Research On Fuzzy C-Means Cluster Analysis And Its Applications
4	Study On Three-way Decisions Clustering Ensemble Based On Spark
5	The Study And Improvement Of Fuzzy C-means Cluster Algorithm
6	Applications Of K-means Weighted Clustering Ensemble Model In App Market
7	Research And Application Of K-means Algorithm Based On Density And Distance
8	Improved Fuzzy C-Means Clustering Algorithm
9	The Research On Fuzzy Clustering Combination Algorithm And Ensemble Diversity Analysis
10	Research On Fuzzy Clustering And Clustering Ensemble In Data Mining