Font Size: a A A

The Research Of Parallel Clustering Algorithm Of Massive Data In Cloud Computing Environment

Posted on:2015-03-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y J XuFull Text:PDF
GTID:1228330461477056Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering is one of the most fundamental algorithms in data analysis and management and has been applied in many areas of computer science and related fields. However, the emergence of massive data brings many challenges to traditional clustering algorithms, for example poor scalability, low efficiency. At present, cloud computing technology represented by MapReduce has attracted increasingly great attention from business and academic. Moreover, MapReduce has developed into one of the most popular massive data processing models. This paper studies the parallel clustering algorithm of massive data in cloud computing environment. The key research are the k-means, k-means++ and scalable k-means++ clustering algorithms with MapReduce. The goal of these researches is to improve the scalability and efficiency of these clustering algorithms. The main work and research results in this paper are summarized as follows.Considering the poor scalability of k-means++ initialization method caused by its sequential nature and requiring too many iterative MapReduce jobs, this paper proposes a parallel and scalable k-means++ algorithm. Its initialization method only takes one MapReduce job to choose k centers. In Map phase it runs standard k-means++ initialization algorithm and weighted k-means++ initialization algorithm is executed during the Reduce phase. This method not only improves the efficiency of k-means++ in processing massive data, but also is proved to be an O(a2) approximation to the optimal clustering result of k-means, a= 8(2+In k).For the initialization method of scalable k-means++algorithm still has to launch two MapReduce jobs in each iteration, we configure the oversampling technique on Map phase and refining technique on Reduce phase and propose a fast scalable k-means++ algorithm. Each iteration of its initialization method requires only one MapReduce job. It saves a lot of I/O cost and time and greatly improves the efficiency of scalable k-means++ algorithm.The workload of Reduce task is unbalanced when MapReduce k-means algorithm processes massive skewed data. This leads to great running time differences of Reduce tasks, increases the whole processing time and degrades the resource utilization of cloud computing platform. Considering this situation, this paper proposes a data partitioning method based on sampling and estimating. It uses the theory of sampling and estimating to process the whole data and makes a good data partition scheme with C2 or CSC method. Finally, this scheme is applied to MapReduce k-means algorithm. Experimental results show that this method balances the workload of Reduce tasks and reduces the running time of MapReduce k-means algorithm.
Keywords/Search Tags:Cloud Computing, Massive Data, MapReduce, Clustering Algorithm, Data Skew
PDF Full Text Request
Related items