Internet information contained extremely rich commercial value, how to dig out theuseful information from these massive data is an important issue. Data Mining emerged inthis context, and cluster analysis is an important part of data mining, which is based on thecharacteristics of the data to classify an object, the object is similar to the poly cluster whichCluster analysis is widely used,including market research,product recommendations, imageprocessing, data analysis.But the "information explosion" so massive amounts of datagenerated by cluster analysis and calculation is very slow and can not meet today’s businessneeds, therefore, to improve parallelism is imminent.Developed by California Berkeley AMP Lab new memory computing distributedframework Spark mainly aimed at the mass data processing and machine learning. Comparedwith the traditional parallel computing framework, the characteristics of its memorycomputing can be good to adapt to the iterative calculation, at the same time in datasegmentation, parallel processing, fault tolerant robustness aspects has carried on thepackaging, can be well adapted to the development of parallel computing.K-means algorithm is a widely used clustering analysis algorithm, generally use theerror sum of squares criterion function as a clustering criterion, high efficiency when dealingwith data sets and the clustering result is good. But in the face of huge amounts of data,calculating the distance between the huge amounts of data objects encountered bottleneck,data size, increase of iterative calculation, the calculation time is too long. And the algorithmitself is k value is uncertain, the initial clustering center problem of random selection, canaffect the accuracy of clustering results and algorithm efficiency.In order to breakthrough in the face of huge amounts of data computing bottleneck, thisarticle on the Spark platform implementation k-means algorithm parallelization. Aiming atthe shortcomings of the k-means algorithm, this page uses the canopy algorithm for k-meansalgorithm is optimized to improve the efficiency and accuracy of clustering results, and on the Spark platform to realize the parallelization of canopy-kmeans algorithm. Based on k-means of Spark platform parallel algorithm and the canopy-kmeans parallel algorithms inaccuracy, speed ratio, scalability and performance compared with other platform. After theparallelization of the experimental results show that the algorithm is better clustering results,in the face of huge amounts of data have a good speedup and scalability. Compared with theHadoop platform, based on parallel Spark platform algorithm is more efficient. Sparkplatform and carry out a resources demand different clustering task, the resourcemanagement platform YARN scheduling tasks compared with resource managementplatform Mesos has a higher efficiency of performing tasks. Research shows that, incombination with the Spark+YARN platform to realize parallelization is feasible andefficient and has practical significance. |