Font Size: a A A

Research And Application Of K-means++ Algorithm Based On Spark Platform

Posted on:2020-11-30Degree:MasterType:Thesis
Country:ChinaCandidate:J Y DuFull Text:PDF
GTID:2428330578455272Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The cost of labeling the massive data generated in the era of big data is too high.Cluster analysis,as an unsupervised learning method,can mine massive unlabeled data and discover the laws contained in the data.The K-Means algorithm is a representative algorithm in cluster analysis,which has the advantages of simplicity,speed,and good scalability.However,the existence of the number of clusters K requires manual designation,and the performance depends on the selection of the initial cluster center.Aiming at these shortcomings,the improved K-Means algorithm is proposed to improve the clustering quality,and the Spark cloud computing framework is used for parallel processing to improve the parallel computing performance of the algorithm.The main research and improvement work is as follows:This paper proposes the Spark-based Parallel Improved K-Means(SPI K-Means)algorithm.Firstly,aiming at the subjectivity of K value and the random selection of the initial cluster center in K-Means algorithm,the cluster result is unstable and easy to fall into the local minimum.An improved K-Means algorithm is proposed to improve the efficiency of the algorithm.The K-Means++ algorithm is used to determine the appropriate K initial cluster centers,and the morphological similarity distance MSD is used as the similarity measure.The algorithm is improved from these three aspects.Simulation experiments on the UCI standard dataset show that the improved K-Means algorithm is superior to the traditional K-Means algorithm and Spark-based Kd-Tree K-Means(SKDK-Means)algorithm in terms of running time and accuracy,which not only improves the running speed but also improves the speed as well as the quality of the classification.Secondly,since the improved K-Means algorithm is time consuming to calculate the distance,and the amount of calculation increases as the number of sample points increases,so the running time is too long.Apache Spark is a distributed framework for big data clustering calculation.In order to solve the problem that the improved KMeans algorithm runs slowly under the large data set,the improved algorithm combines with the Spark cloud computing framework to propose the SPIK-Means algorithm,through the comparison of running time between different node numbers and different data sets,the data shows that the SPIK-Means algorithm can maintain good parallel computing performance in the cluster environment,and effectively improve the algorithm execution efficiency.Thirdly,the remote sensing image data was classified by the proposed SPIKMeans algorithm.From the comparison of the K value to the simplified silhouette index in order to determine the K value,and then calculate the user accuracy and product accuracy according to the confusion matrix for analysis.By comparing the effects of the SPIK-Means algorithm with the traditional K-Means algorithm in the remote sensing image classification experiment in parallel environment,the overall accuracy and Kappa coefficient are summarized.The experimental results show that the SPIKMeans algorithm is more accurate and better than the K-Means algorithm in remote sensing image classification.
Keywords/Search Tags:SPIK-Means, K-Means algorithm, Apache Spark, simplified silhouette index, remote sensing image
PDF Full Text Request
Related items