Font Size: a A A

Optimized Design And Implementation Of K-means Algorithm Based On Big Data Spark Platform

Posted on:2018-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:R ZhangFull Text:PDF
GTID:2438330602459299Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the context of technological development and social production,the vast amount of data is generated in our daily life,which contains enormous economic and social values,how these values are tapped and used to promote economic development and social progress is significant.In order to improve the clustering algorithm to deal with the processing capacity of large data sets,you can now use a lot of large data processing platform,doubling the efficiency of data processing,and Spark is one of the most powerful way.In this paper,the k-means algorithm is studied and the improved algorithm is paralleled on the Spark platform.Aiming at the instability of the clustering effect caused by the extremely uncertain number of the initial clustering center of the traditional k-means algorithm,an improved algorithm is proposed to solve this problem.The UPGMA(unweighted pair-group method with arithmetic means)algorithm is used to keep the data points closer together merging into a new cluster.When the number of data in the cluster reaches the set threshold,it is added to the queue and deleted in the queue.The cluster application as well as the maximum and minimum distance algorithm are obtained to reflect the overall data distribution center point,to improve the stability and accuracy of clustering.At the same time,in order to take full advantages of Spark,you can optimize from the aspects of memory optimization,data compression,cluster setup and so on to improve the application ability of the improved algorithm on Spark platform.There are two main contributions in this paper.(1)The improved k-means algorithm is superior to the traditional k-means algorithm in accuracy rate,recall rate and F-measure value.(2)Spark-based improved k-means algorithm is faster in the convergence ratio.It can be concluded that the improved k-means algorithm based on Spark is superior to the traditional k-means algorithm.
Keywords/Search Tags:Big data, K-means, Clustering algorithm, Spark, Parallelization
PDF Full Text Request
Related items