Font Size: a A A

The Parallelization And Optimization Of K-means Algorithm Based On Spark

Posted on:2016-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhangFull Text:PDF
GTID:2348330479454697Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Mobile Internet wave has derived massive data, these data contains immeasurable business value and practical significance, and how to mine useful information from these chaotic massive data has become a quite important research topic. In order to quickly promote it’s capability of massive data processing, we can use cluster with the integration of resources to effectively perform the task of data mining, and the parallel improved mining algorithm combined with the distributed computing platform can effectively solve the problem of data mining tasks.This research studies the parallel implementation of k-means in Spark. On the one hand, based on uncertain initialized K value and randomly selected initial cluster centers of K-means, this research proposed the Canopy algorithm of pre clustering to initialize the K-means algorithm of K value and initial clustering centers to improve the stability of the convergence speed and clustering results. On the other hand, in order to make full use of the RDD features of Spark, this research can do the Spark tuning form the aspects of memory optimization, data compression, data serilization, executor-memory ratio, executor-shiffle ratio, Cache size, et al. Thus, the parallel computing efficiency and application ability in distributed computing environment of improved Canopy_K-means(CKM) algorithm will further improved.Based on the comparative experiment results of improved CKM and K-means on Spark cluster environment, it shows the following conclusions:(1)Spark has incomparable efficiency(convergence rate, clustering accuracy) in the iterative calculation relative to Hadoop;(2) The clustering result of CKM parallel algorithm is more accurate and reliable than K-means parallel algorithm on Spark, and the convergence speed of CKM parallel algorithm is faster than K-means parallel algorithm;(3) The speed-up ratio amplitude of improved CKM parallel algorithm is faster than K-means parallel algorithm on Spark, and the expansion ratio of the former is more quickly converged to a stable value. Overall, the improved CKM parallel clustering algorithm is more efficient(accuracy, convergence rate, parallel performance) than the traditional K-means algorithm on Spark.
Keywords/Search Tags:K-means, Distributed computing, Parallelization, Spark, Clustering
PDF Full Text Request
Related items