Font Size: a A A

Research On Parallel Sampling K-Means Algorithm Based On MapReduce

Posted on:2017-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:P CuiFull Text:PDF
GTID:2428330548983848Subject:Computer technology
Abstract/Summary:PDF Full Text Request
K-means algorithm is widely used in business,academic and other fields because of its simple,fast and easy to implement.But the algorithm depends on the selection of the initial value,poor clustering accuracy,and the face of massive data processing is prone to storage problem.Due to the wide application of Hadoop,the parallel of K-means algorithm is realized,and on this basis to make improvement of Canopy-kmeans algorithm,better solved the massive data storage and the selection of the initial value problem,because it is the global pretreatment of the data,the cost of the initial value selection is higher.Therefore,in view of the above problems,this paper proposes a parallel sampling K-means algorithm based on MapReduce.Using K selection sort algorithm combined with MapReduce programming model for parallel sampling,improve the sampling efficiency.Based on sample preprocessing strategy,to achieve the rapid acquisition of initial value.In the end,replace the mean iteration with the method of weight substitution,which can improve the accuracy of clustering.And through cluster optimization,further improve the efficiency of the algorithm.Experimental results show that the parallel algorithm has better clustering results and speedup,and the performance of the algorithm is improved further in the comparison experiment of the optimized cluster.
Keywords/Search Tags:K-means algorithm, K selection sort, MapReduce, cluster optimization
PDF Full Text Request
Related items