Research On Spark Oriented Fuzzy C-means Clustering Algorithm

Posted on:2016-10-07

Degree:Master

Type:Thesis

Country:China

Candidate:P Liang

Full Text:PDF

GTID:2308330479989726

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Now we can collect massive amount of data through a variety of advanced technologies. The potential value of these data was ignored in the past due to the lack of effective technologies. But now we can exploit the data and learn the characteristics of human and social behaviors to achieve more economic benefits. Clustering is an important part of data mining, and fuzzy clustering refers to a family of methods that are typically used in data analysis. The thesis aims to apply fuzzy c-means(fcm) clustering to large scale data analysis. Through scalable and parallel adaptation of the classical algorithm, the new algorithms are deployed on a modern distributed processing framework to better utilize resources presented by large clusters of computers, and combines new approaches to effectively improve the performance of the classical algorithm in distributed scenarios.This thesis will explain the research from the following aspects:1) Perform comparisons between the clustering methods and classify them according to their characteristics. Choose a more precise fuzzy clustering and a partition-based clustering method that is more suitable for deployment and application in big data scenarios,which is fuzzy c-means clustering algorithm.2) Perform a comparison between the performance of Hadoop and Spark in terms of iterative tasks after the in-depth study of their distributed models. And choose a distributed framework that is more suitable for iterative clustering algorithms, which is Spark.3) Implement parallel fuzzy c-means algorithm by utilizing Spark programming model. However fuzzy c-means adopts a randomly selected clustering centers tactic for its initialization process, which results in many uncertainties on the behaviors of body iteration and the precision of end results and will be a huge loss in big data scenarios.The research refers to improved strategies of k-means initialization in order to enhance the performance of improved fuzzy c-means. The research extends k-means∥ to fuzzy cmeans to achieve better clustering performance, and develops parallel and scalable fuzzy c-means with Spark programming model.4) Based on the algorithm proposed in step 3, fuzzy c-means is suitable for dense spherical structure, but its clustering effect for non-convex structure is poor. Since data structure is very complicated in big data scenarios, classical fuzzy c-means that only supports single structure is not suitable anymore. To meet the clustering demand of various data structures in big data scenarios, this research introduces the kernel function to make sure that fuzzy c-means can achieve better clustering results in linear and non-linear data,so that parallel and scalable kernel fuzzy c-means algorithm using Spark programming model is also developed.Two algorithms proposed in this thesis show good scalability and parallelism in several experiments on real and hypothetical datasets, and effectively extend fuzzy c-means clustering for distributed applications. The two algorithms greatly expand the scale of algorithm processing, effectively improve the algorithm robustness and make the classical algorithm more suitable for various data structures. Thus parallel and scalable clustering algorithms can perform clustering analysis for big data more effectively.

Keywords/Search Tags:

fuzzy c-means, cluster analysis, distributed processing, big data, Spark, parallel algorithm

PDF Full Text Request

Related items

1	Research On Parallelization Of Data Mining Algorithm Based On Distributed Platforms Spark And YARN
2	Research On Parallel K-means Algorithm Based On Genetic Algorithm
3	Parallelizing K-means-based Clustering On Spark
4	Research And Application Of K-means Algorithm Based On Density And Distance
5	The Research On Fuzzy C-Means Cluster Analysis And Its Applications
6	Theoretical And Applied Research On Fuzzy C-means Clusteirng And Its Cluster Validation
7	The Study And Improvement Of Fuzzy C-means Cluster Algorithm
8	Improved Fuzzy C-Means Clustering Algorithm
9	Research On Parallel Random Forest And Fuzzy C-Means Algorithm For Imbalanced Data
10	Research And Application Of FCM Algorithms Based On Spark