The Research Of Parallel Clustering Algorithm Of Massive Data In Cloud Computing Environment

Posted on:2015-03-12

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y J Xu

Full Text:PDF

GTID:1228330461477056

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Clustering is one of the most fundamental algorithms in data analysis and management and has been applied in many areas of computer science and related fields. However, the emergence of massive data brings many challenges to traditional clustering algorithms, for example poor scalability, low efficiency. At present, cloud computing technology represented by MapReduce has attracted increasingly great attention from business and academic. Moreover, MapReduce has developed into one of the most popular massive data processing models. This paper studies the parallel clustering algorithm of massive data in cloud computing environment. The key research are the k-means, k-means++ and scalable k-means++ clustering algorithms with MapReduce. The goal of these researches is to improve the scalability and efficiency of these clustering algorithms. The main work and research results in this paper are summarized as follows.Considering the poor scalability of k-means++ initialization method caused by its sequential nature and requiring too many iterative MapReduce jobs, this paper proposes a parallel and scalable k-means++ algorithm. Its initialization method only takes one MapReduce job to choose k centers. In Map phase it runs standard k-means++ initialization algorithm and weighted k-means++ initialization algorithm is executed during the Reduce phase. This method not only improves the efficiency of k-means++ in processing massive data, but also is proved to be an O(a2) approximation to the optimal clustering result of k-means, a= 8(2+In k).For the initialization method of scalable k-means++algorithm still has to launch two MapReduce jobs in each iteration, we configure the oversampling technique on Map phase and refining technique on Reduce phase and propose a fast scalable k-means++ algorithm. Each iteration of its initialization method requires only one MapReduce job. It saves a lot of I/O cost and time and greatly improves the efficiency of scalable k-means++ algorithm.The workload of Reduce task is unbalanced when MapReduce k-means algorithm processes massive skewed data. This leads to great running time differences of Reduce tasks, increases the whole processing time and degrades the resource utilization of cloud computing platform. Considering this situation, this paper proposes a data partitioning method based on sampling and estimating. It uses the theory of sampling and estimating to process the whole data and makes a good data partition scheme with C2 or CSC method. Finally, this scheme is applied to MapReduce k-means algorithm. Experimental results show that this method balances the workload of Reduce tasks and reduces the running time of MapReduce k-means algorithm.

Keywords/Search Tags:

Cloud Computing, Massive Data, MapReduce, Clustering Algorithm, Data Skew

PDF Full Text Request

Related items

1	Research Of Join Algorithm With Skew Data On Mapreduce
2	The Research Of Handling Data Skew In MapReduce Computing Model
3	Research On The Clustering Algorithm Of Parallel Partition Based On MapReduce
4	Research And Implementation Of Local Priority Scheduling Algorithm Based On Mapreduce For Massive Data
5	Performance Optimization And Applications Of MapReduce In Cloud Computing
6	The Research Of Scheduling Algorithms For Performance And Energy Consumption Under The Condition Of Data Skew
7	Research, Design And Application Of Clustering Algorithm Using Mapreduce
8	Research On Optimal Reduce Placement Algorithm Based On Data Skew
9	The Research And Implementation Of Comprehensive Mapreduce
10	Performance-Aware Scheduling For Data-Intensive Cloud Computing