Font Size: a A A

Research On Optimization And Parallel Of K-means Algorithm On Spark

Posted on:2021-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y N XingFull Text:PDF
GTID:2518306128974429Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the context of big data,data mining plays a pivotal role in various industries.The classical machine learning algorithm--K-means clustering algorithm is more and more widely applied in the context of big data,and its shortcomings are more and more obvious when dealing with large-scale data.When the K-means algorithm is running,it is necessary to determine the initial center point and the number of clusters of the algorithm in advance,and problems such as a decrease in the accuracy of the clustering result and a long time consumption occur.The traditional K-means algorithm is no longer enough to meet the requirements of real-time,accuracy and other industrial applications.Therefore,under the background of big data,the optimization of K-means algorithm and its acceleration have become the focus of current scholars.In view of the above problems,this study selects Spark memory computing framework and starts from two aspects,improving the way K-means algorithm selects initial value and accelerating algorithm parallelization,so as to improve the performance of K-means algorithm.Specific research contents and contributions are summarized as follows:(1)In the case of large-scale data,considering the K-means algorithm is sensitive to noise data sets of problem such as poor clustering results,MD-K-means algorithm based on the calculation of maximum minimum distance and density is proposed.The improvement is divided into two aspects: algorithm optimization and parallelization.Firstly,the K value of clustering algorithm is determined by the maximum and minimum distance algorithm,according to the density of data sets at the same time determine the initial center of K-means algorithm,and then achieve the optimization of K-means algorithm.Furthermore,in Spark,a parallel clustering algorithm is implemented and a comparison experiment is designed.Experimental analysis shows that in terms of clustering effect and performance,the improved K-means algorithm performs better.(2)Aiming at the problem that the accuracy of clustering algorithm is reduced due to the random selection of clustering number K and initial value of the K-means algorithm,a K-means algorithm improvement strategy for improving Canopy algorithm is proposed.In order to determine the Canopy distance threshold algorithm to estimate the clustering problem such as the effect not beautiful,the data set is a distance threshold,moreover Canopy of the improved algorithms are used to determine the initial value of K-means algorithm,compared with the experimental analysis,this study improved K-means algorithm in clustering time-consuming,accuracy and error sum of squares have obvious improvement.(3)Considering the low efficiency of the K-means algorithm in the operation of large data sets,a parallel acceleration strategy of CK-means algorithm proposed based on Spark platform.,the initial value of the K-means algorithm was determined by Canopy algorithm and mean calculation method in this study to reduce the number of iterations of the algorithm and improve the operation efficiency of the algorithm.The Spark platform was analyzed,and K-means was designed in parallel in the YARN mode.Experimental results show that the acceleration ratio of the CK-means algorithm is significantly higher than that of the traditional algorithm.
Keywords/Search Tags:K-means algorithm, Algorithm optimization, Spark memory computing framework, Clustering algorithm, Initial center point
PDF Full Text Request
Related items