In the wake of ceaseless development of Internet technology,the explosive increase of data has promoted the coming of big data era.Finding data values,by virtue of technologies of data mining,has gained increasing attentions,and clustering analysis is a significant branch amongst data mining technologies.At present,most traditional clustering algorithms run serially in stand-along device and could not satisfy the needs of dealing mass data due to single computer's limits on internal storage,storage space and computing power.However,the development of distributed computing technology offers opportunity for dealing this issue.This paper studies and improves Canopy-Kmeans algorithm and CFSFDP algorithm in the clustering analysis by combing Spark distributed computing framework.The research includes essential views below:(1)Canopy-Kmeans algorithm selects the initial centers randomly and the clustering results are easily effected by parameter.This paper improves these issues by using density peaks and the maximum and minimum principle respectively.Meanwhile,it reduces the influence of noise point towards algorithm and realizes parallel processing of algorithm by adopting Spark framework.(2)CFSFDP algorithm has to select center points through artificial selection of decision graph and this step not only has inaccuracy issue,but also leads to the failure of auto-parallel computing.In this paper,the clustering center points were automatically find by using the idea of slope to calculate the demarcation point of clustering center points and un-clustering center points.Following this,the improved algorithm enables us to confirm clustering center points automatically through calculation and realizes parallel processing of algorithm through Spark framework.By virtue of computer cluster of Spark on Yarn,the experiments show that the improved Canopy-Kmeans algorithm which based on density peaks,and CFSFDP algorithm which selects center points automatically,both have great clustering results and parallel capability. |