The Research And Implementation Of The Parallelization Of The Clustering Algorithm In Cluster Environment

Posted on:2013-07-10

Degree:Master

Type:Thesis

Country:China

Candidate:J Hu

Full Text:PDF

GTID:2248330374967139

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Cluster Analysis is a common technique in data mining. Based on the feature of data object, Cluster analysis assigning a set of objects into clusters so that the objects in the same cluster are more similar to each other than to those in other clusters. It has been widely used in many fields, such as market research, image processing and Internet data analysis. However, the data explosions in these fields make the computation to be the time-consuming tasks, so it can’t meet the timeliness requirements of data mining. The parallel methods are considered to use to improve the efficiency and scalability of the algorithms.K-Means clustering algorithm is a widely used cluster analysis method in many fields. However, with the growth of data size and dimension, the iterations computing of the K-Means becomes a time-consuming job. In order to apply the K-Means algorithm to the clustering analysis of massive data sets, we aim to implement the parallelization of K-Means algorithm, to make clustering executed paralleled on multiple computers. As a standard message passing library, MPI provides a communication interface of the system and support communication between applications. More flexibility and controllability are available to the development of parallel applications. Hadoop is an open source distributed computing framework, it uses MapReduce programming model that proposed by Google, which package the system modules of parallel execution, communication, task scheduling and dynamic fault-tolerant into the underlying library. It provides a high-level programming interface to the developers, thus they only need to focus on application logic. Spark is also a cluster of data analysis platform; it provides a distributed dataset (RDD) and high-level programming interface to build a parallel data analysis applications.In this paper, we analyzes the feature of the three computing platforms, using its’ technology to implement the parallel K-Means clustering algorithm; Then we analyze the performance between the parallel algorithm and serial algorithm as well as three computing platforms. The experimental results show that the performance of parallel algorithms in the three computing platforms got a good speed-up and scale-up when dealing with large data sets; When compare the efficiency between the three computing platforms, the MPI-based parallel algorithm performs better than others without fault-tolerance for analysis process; Hadoop-based parallel algorithm got lower computational efficiency with a robust fault-tolerance; The performance of Spark-based clustering algorithm is closer to the MPI-based algorithm, and it also support fault-tolerant in clustering process, so it’s suitable for massive data clustering analysis application.

Keywords/Search Tags:

Cluster Analysis, Clusters Environment, Parallelization, K-Means

PDF Full Text Request

Related items

1	An Advanced Partition Clustering And Parallelization On Cluster Environment
2	Research Andapplication On Determining Optimal Number Of Clusters In Cluster Analysis
3	Research On Determining Optimal Number Of Clusters In Cluster Analysis
4	Improvements And Implementation Of K-means Clustering Algorithm
5	Design And Implementation Of Distributed Text Clustering System Based On K-means
6	Application Research Of Improved K-means Algorithm In Big Data Clustering
7	Research On Parallel K-means Algorithm Based On Genetic Algorithm
8	Some Problems Of Determining The Optimal Number Of Clusters In Clustering Analysis
9	Research On Data Mining Algorithm Based On Marine Environment
10	Number Of Clusters In Cluster Analysis To Determine The Problem