Font Size: a A A

Research And Realization Of Clustering Algorithm Based On Spark Platform

Posted on:2019-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:K HuFull Text:PDF
GTID:2428330545470723Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The traditional k-means clustering algorithm randomly selects a specific sample from the samples to be clustered as an initial point to start clustering.If the initial points are selected differently,then the result of each clustering algorithm may be different,which may cause instability of the result.In order to make the clustering result more stable,it is very important to study how to get the exact number of center points before clustering algorithm and select the appropriate initial center point correctly.Mean shift algorithm is a non-parametric density estimation algorithm.The Mean shift algorithm can be quickly looping to the point where the probability density function is maximized.The process of the algorithm is to instantly look for the local maximum of probability density.Mean shift algorithm can quickly find the cluster center point.Based on the algorithm of Mean Shift improved in Spark cluster environment,the disadvantages of k-means are deeply studied.The main achievements are as follows:First of all,deeply studied the theory and implementation process of k-means algorithm,and the limitation of k-means algorithm selecting initial point randomly is pointed out.Secondly,the theory of Mean Shift algorithm is deeply studied,and the Mean Shift algorithm is proposed to optimize k-means algorithm.Thirdly,choosing the point farest away from all the points choosing before makes the selection of the initial points more dispersed,avoids the situation of falling into the local optimal solution when the points are randomly selected,and makes the clustering effect better.Finally,improving the k-means algorithm by using the Mean Shift algorithm to converge faster is that the changed algorithm can cluster faster.
Keywords/Search Tags:k-means, distributed computing, Spark, Mean Shift
PDF Full Text Request
Related items