Font Size: a A A

Research And Application Of K-means Clustering Algorithm Based On Distributed Computing Platform

Posted on:2019-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LiuFull Text:PDF
GTID:2428330590465786Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the wide application of mobile Internet and Internet of things,the volume of data has been growing rapidly in various industries.How to extract useful information from massive data has become a hotspot in big data.As a simple but efficient clustering algorithm in data mining,K-Means algorithm is widely used in data analysis and processing.The biggest challenge of K-Means clustering algorithm lies in the selection of center points,the reduction of local optimal condition and the number of iterations to improve the quality and efficiency of clustering.Therefore,it is of great significance for real time analysis to optimize the K-Means algorithm in current distributed computing environment.Based on the distributed parallel computing framework,the thesis has a deep research of the K-Means algorithm and makes some optimizations for Parallel implementation.The main works are as follows:1.Some problems like local optimum,unstable clustering results,and excessive number of iterations in clustering process always appear in the native K-Means algorithm.To solve these problems,a random maximum minimum distance K-Means optimization algorithm is proposed.The optimization algorithm was implemented under Hadoop with MapReduce.Compared with the native K-Means and the maximum minimum distance K-Means through experiments,the optimized algorithm can reduce the iterations number of initial point selection,improve the accuracy of clustering,and gain low latency at the same time.2.There still exist some limitations in main work 1 when looking for the initial center points,a random sampling cluster center selection method based on the Spark framework is further proposed.Compared with the native and optimized K-Means algorithms through experiments under Hadoop and Spark respectively,the optimized algorithm in Spark shows a better scalability and lower latency.In summary,the optimized algorithm of K-Means and its parallel implementation based on distributed computing platform can effectively improve the accuracy,recall rate and execution efficiency of the algorithm,and has a better acceleration ratio and scalability.It is more suitable for the clustering analysis in big data environment.
Keywords/Search Tags:K-Means Clustering Algorithm, Random Maximum and Minimum Distance, Random Sampling, Distributed Computing Platform, Parallel Implementation
PDF Full Text Request
Related items