Font Size: a A A

The Research And Implementation Of The Parallelization Of The Clustering Algorithm In Cluster Environment

Posted on:2013-07-10Degree:MasterType:Thesis
Country:ChinaCandidate:J HuFull Text:PDF
GTID:2248330374967139Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Cluster Analysis is a common technique in data mining. Based on the feature of data object, Cluster analysis assigning a set of objects into clusters so that the objects in the same cluster are more similar to each other than to those in other clusters. It has been widely used in many fields, such as market research, image processing and Internet data analysis. However, the data explosions in these fields make the computation to be the time-consuming tasks, so it can’t meet the timeliness requirements of data mining. The parallel methods are considered to use to improve the efficiency and scalability of the algorithms.K-Means clustering algorithm is a widely used cluster analysis method in many fields. However, with the growth of data size and dimension, the iterations computing of the K-Means becomes a time-consuming job. In order to apply the K-Means algorithm to the clustering analysis of massive data sets, we aim to implement the parallelization of K-Means algorithm, to make clustering executed paralleled on multiple computers. As a standard message passing library, MPI provides a communication interface of the system and support communication between applications. More flexibility and controllability are available to the development of parallel applications. Hadoop is an open source distributed computing framework, it uses MapReduce programming model that proposed by Google, which package the system modules of parallel execution, communication, task scheduling and dynamic fault-tolerant into the underlying library. It provides a high-level programming interface to the developers, thus they only need to focus on application logic. Spark is also a cluster of data analysis platform; it provides a distributed dataset (RDD) and high-level programming interface to build a parallel data analysis applications.In this paper, we analyzes the feature of the three computing platforms, using its’ technology to implement the parallel K-Means clustering algorithm; Then we analyze the performance between the parallel algorithm and serial algorithm as well as three computing platforms. The experimental results show that the performance of parallel algorithms in the three computing platforms got a good speed-up and scale-up when dealing with large data sets; When compare the efficiency between the three computing platforms, the MPI-based parallel algorithm performs better than others without fault-tolerance for analysis process; Hadoop-based parallel algorithm got lower computational efficiency with a robust fault-tolerance; The performance of Spark-based clustering algorithm is closer to the MPI-based algorithm, and it also support fault-tolerant in clustering process, so it’s suitable for massive data clustering analysis application.
Keywords/Search Tags:Cluster Analysis, Clusters Environment, Parallelization, K-Means
PDF Full Text Request
Related items