Font Size: a A A

Research And Application Of K-means Algorithm Based On Spark

Posted on:2022-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:D L WangFull Text:PDF
GTID:2518306539961589Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
In today's information age,the data generated in various fields of society shows a blowout growth.How to mine the potential valuable information from the massive and complex data has become a very hot research topic,k-means As a common clustering algorithm in data mining,K-means algorithm has simple principle,high efficiency and accurate clustering effect.However,the iterative speed of this algorithm is slow when dealing with large-scale data,and the selection of initial clustering center will also have a great impact on the clustering results.Secondly,facing the challenge of massive data,K-means algorithm running on a single machine can not meet the increasing demand Long data computing needs.In view of the above problems,this paper proposes an improved k-means algorithm,and combines it with spark distributed platform to realize the parallelization of the algorithm,so as to improve the performance of the algorithm in processing massive data.The main work of this paper are as follows:(1)The HDFS distributed storage system of Hadoop platform,the horn resource manager and spark computing framework are prepared for theoretical knowledge.At the same time,the principle and shortcomings of K-means clustering algorithm are studied.(2)In view of the limitations of traditional K-means algorithm in dealing with big data,combined with random gradient descent,Adam algorithm is used to determine the direction of updating gradient adaptively,and then the exponential decay learning rate is used to control the change of learning rate,so that the random k-means algorithm converges better.Finally,the selection of the initial center and the running efficiency are improved,and the parallel scheme of the improved algorithm is designed through the characteristics of spark computing framework and RDD,and the running efficiency is improved through the parallel implementation of the algorithm on spark distributed platform.(3)Build spark cluster as the experimental platform.On the one hand,the performance of the improved algorithm is evaluated.The experimental results show that the improved kmeans algorithm has significantly improved the clustering accuracy and robustness compared with the traditional K-means algorithm and K-means + + algorithm;On the other hand,the experiment of speedup ratio and expansion ratio is carried out,and the results show that the algorithm has better speedup ratio and expansion ratio,higher practicability and good parallel operation efficiency when processing large-scale data sets in spark cluster.(4)Based on B/S architecture and spring series framework,a telecom user analysis system is built.The improved algorithm proposed in this paper is applied to spark distributed computing platform and used for user segmentation in Telecom user analysis system.According to the results,the consumption behavior and characteristics of various users are analyzed,and different marketing schemes are formulated.In order to improve the clustering effect and operation efficiency of K-means algorithm in data mining,an improved k-means algorithm is proposed,and the performance of the algorithm is verified by experiments.The experimental results show that the parallel operation of the improved k-means algorithm based on spark has good clustering effect and operation efficiency.Finally,the improved algorithm proposed in this paper is applied to spark distributed computing platform for user segmentation in Telecom user analysis system,which verifies the effectiveness and application value of the improved algorithm.
Keywords/Search Tags:Spark, K-means algorithm, parallelization, user analysis
PDF Full Text Request
Related items