Optimized Design And Implementation Of K-means Algorithm Based On Big Data Spark Platform

Posted on:2018-05-10

Degree:Master

Type:Thesis

Country:China

Candidate:R Zhang

Full Text:PDF

GTID:2438330602459299

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In the context of technological development and social production,the vast amount of data is generated in our daily life,which contains enormous economic and social values,how these values are tapped and used to promote economic development and social progress is significant.In order to improve the clustering algorithm to deal with the processing capacity of large data sets,you can now use a lot of large data processing platform,doubling the efficiency of data processing,and Spark is one of the most powerful way.In this paper,the k-means algorithm is studied and the improved algorithm is paralleled on the Spark platform.Aiming at the instability of the clustering effect caused by the extremely uncertain number of the initial clustering center of the traditional k-means algorithm,an improved algorithm is proposed to solve this problem.The UPGMA(unweighted pair-group method with arithmetic means)algorithm is used to keep the data points closer together merging into a new cluster.When the number of data in the cluster reaches the set threshold,it is added to the queue and deleted in the queue.The cluster application as well as the maximum and minimum distance algorithm are obtained to reflect the overall data distribution center point,to improve the stability and accuracy of clustering.At the same time,in order to take full advantages of Spark,you can optimize from the aspects of memory optimization,data compression,cluster setup and so on to improve the application ability of the improved algorithm on Spark platform.There are two main contributions in this paper.(1)The improved k-means algorithm is superior to the traditional k-means algorithm in accuracy rate,recall rate and F-measure value.(2)Spark-based improved k-means algorithm is faster in the convergence ratio.It can be concluded that the improved k-means algorithm based on Spark is superior to the traditional k-means algorithm.

Keywords/Search Tags:

Big data, K-means, Clustering algorithm, Spark, Parallelization

PDF Full Text Request

Related items

1	The Parallelization And Optimization Of K-means Algorithm Based On Spark
2	Optimized Design And Implementation Of K-means Algorithm Based On Big Data Spark Platform
3	Research On Cloud Computing Search Engine Design And Parallelization K-means Clustering Algorithms For Big Data
4	Research And Application Of Big Data Clustering Algorithm Based On Spark Platform
5	Research On Parallelization Of K-means Algorithm Based On Spark Plat Form
6	Research On K-medoids Clustering Algorithm Based On Spark
7	Research On Parallelization Of Clustering Algorithm Based On MapReduce
8	Research And Application Of K-means Algorithm Based On Spark
9	Research On Parallelization Of Data Stream Clustering Algorithm For Police Data
10	Optimization And Application Of K-means Clustering Algorithm Based On Spark Framework