The Parallelization And Optimization Of K-means Algorithm Based On Spark

Posted on:2016-09-25

Degree:Master

Type:Thesis

Country:China

Candidate:B Zhang

Full Text:PDF

GTID:2348330479454697

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Mobile Internet wave has derived massive data, these data contains immeasurable business value and practical significance, and how to mine useful information from these chaotic massive data has become a quite important research topic. In order to quickly promote it's capability of massive data processing, we can use cluster with the integration of resources to effectively perform the task of data mining, and the parallel improved mining algorithm combined with the distributed computing platform can effectively solve the problem of data mining tasks.This research studies the parallel implementation of k-means in Spark. On the one hand, based on uncertain initialized K value and randomly selected initial cluster centers of K-means, this research proposed the Canopy algorithm of pre clustering to initialize the K-means algorithm of K value and initial clustering centers to improve the stability of the convergence speed and clustering results. On the other hand, in order to make full use of the RDD features of Spark, this research can do the Spark tuning form the aspects of memory optimization, data compression, data serilization, executor-memory ratio, executor-shiffle ratio, Cache size, et al. Thus, the parallel computing efficiency and application ability in distributed computing environment of improved Canopy_K-means(CKM) algorithm will further improved.Based on the comparative experiment results of improved CKM and K-means on Spark cluster environment, it shows the following conclusions:(1)Spark has incomparable efficiency(convergence rate, clustering accuracy) in the iterative calculation relative to Hadoop;(2) The clustering result of CKM parallel algorithm is more accurate and reliable than K-means parallel algorithm on Spark, and the convergence speed of CKM parallel algorithm is faster than K-means parallel algorithm;(3) The speed-up ratio amplitude of improved CKM parallel algorithm is faster than K-means parallel algorithm on Spark, and the expansion ratio of the former is more quickly converged to a stable value. Overall, the improved CKM parallel clustering algorithm is more efficient(accuracy, convergence rate, parallel performance) than the traditional K-means algorithm on Spark.

Keywords/Search Tags:

K-means, Distributed computing, Parallelization, Spark, Clustering

PDF Full Text Request

Related items

1	The Design And Implementation Of Parallelization Of Canopy And FCM Clustering Algorithms On Spark Platform
2	Research And Application Of Parallelization Optimization Of Spatial Clustering Algorithm Based On Spark
3	Research On Cloud Computing Search Engine Design And Parallelization K-means Clustering Algorithms For Big Data
4	Research And Realization Of Clustering Algorithm Based On Spark Platform
5	Optimized Design And Implementation Of K-means Algorithm Based On Big Data Spark Platform
6	Research On Parallelization Of K-means Algorithm Based On Spark Plat Form
7	Research On Parallelization Of Data Mining Algorithm Based On Distributed Platforms Spark And YARN
8	The Key Research Of Clustering Algorithm Parallelization On The Platform Of Cloud Computing
9	Designand Implementation Of Data Stream Clustering Algorithm StreamCKS Based On Spark Streaming
10	Research And Application Of FCM Algorithms Based On Spark