Font Size: a A A

Research On Spark Oriented Fuzzy C-means Clustering Algorithm

Posted on:2016-10-07Degree:MasterType:Thesis
Country:ChinaCandidate:P LiangFull Text:PDF
GTID:2308330479989726Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Now we can collect massive amount of data through a variety of advanced technologies. The potential value of these data was ignored in the past due to the lack of effective technologies. But now we can exploit the data and learn the characteristics of human and social behaviors to achieve more economic benefits. Clustering is an important part of data mining, and fuzzy clustering refers to a family of methods that are typically used in data analysis. The thesis aims to apply fuzzy c-means(fcm) clustering to large scale data analysis. Through scalable and parallel adaptation of the classical algorithm, the new algorithms are deployed on a modern distributed processing framework to better utilize resources presented by large clusters of computers, and combines new approaches to effectively improve the performance of the classical algorithm in distributed scenarios.This thesis will explain the research from the following aspects:1) Perform comparisons between the clustering methods and classify them according to their characteristics. Choose a more precise fuzzy clustering and a partition-based clustering method that is more suitable for deployment and application in big data scenarios,which is fuzzy c-means clustering algorithm.2) Perform a comparison between the performance of Hadoop and Spark in terms of iterative tasks after the in-depth study of their distributed models. And choose a distributed framework that is more suitable for iterative clustering algorithms, which is Spark.3) Implement parallel fuzzy c-means algorithm by utilizing Spark programming model. However fuzzy c-means adopts a randomly selected clustering centers tactic for its initialization process, which results in many uncertainties on the behaviors of body iteration and the precision of end results and will be a huge loss in big data scenarios.The research refers to improved strategies of k-means initialization in order to enhance the performance of improved fuzzy c-means. The research extends k-meansāˆ„ to fuzzy cmeans to achieve better clustering performance, and develops parallel and scalable fuzzy c-means with Spark programming model.4) Based on the algorithm proposed in step 3, fuzzy c-means is suitable for dense spherical structure, but its clustering effect for non-convex structure is poor. Since data structure is very complicated in big data scenarios, classical fuzzy c-means that only supports single structure is not suitable anymore. To meet the clustering demand of various data structures in big data scenarios, this research introduces the kernel function to make sure that fuzzy c-means can achieve better clustering results in linear and non-linear data,so that parallel and scalable kernel fuzzy c-means algorithm using Spark programming model is also developed.Two algorithms proposed in this thesis show good scalability and parallelism in several experiments on real and hypothetical datasets, and effectively extend fuzzy c-means clustering for distributed applications. The two algorithms greatly expand the scale of algorithm processing, effectively improve the algorithm robustness and make the classical algorithm more suitable for various data structures. Thus parallel and scalable clustering algorithms can perform clustering analysis for big data more effectively.
Keywords/Search Tags:fuzzy c-means, cluster analysis, distributed processing, big data, Spark, parallel algorithm
PDF Full Text Request
Related items