Research And Realization Of Clustering Algorithm Based On Spark Platform

Posted on:2019-01-30

Degree:Master

Type:Thesis

Country:China

Candidate:K Hu

Full Text:PDF

GTID:2428330545470723

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

The traditional k-means clustering algorithm randomly selects a specific sample from the samples to be clustered as an initial point to start clustering.If the initial points are selected differently,then the result of each clustering algorithm may be different,which may cause instability of the result.In order to make the clustering result more stable,it is very important to study how to get the exact number of center points before clustering algorithm and select the appropriate initial center point correctly.Mean shift algorithm is a non-parametric density estimation algorithm.The Mean shift algorithm can be quickly looping to the point where the probability density function is maximized.The process of the algorithm is to instantly look for the local maximum of probability density.Mean shift algorithm can quickly find the cluster center point.Based on the algorithm of Mean Shift improved in Spark cluster environment,the disadvantages of k-means are deeply studied.The main achievements are as follows:First of all,deeply studied the theory and implementation process of k-means algorithm,and the limitation of k-means algorithm selecting initial point randomly is pointed out.Secondly,the theory of Mean Shift algorithm is deeply studied,and the Mean Shift algorithm is proposed to optimize k-means algorithm.Thirdly,choosing the point farest away from all the points choosing before makes the selection of the initial points more dispersed,avoids the situation of falling into the local optimal solution when the points are randomly selected,and makes the clustering effect better.Finally,improving the k-means algorithm by using the Mean Shift algorithm to converge faster is that the changed algorithm can cluster faster.

Keywords/Search Tags:

k-means, distributed computing, Spark, Mean Shift

PDF Full Text Request

Related items

1	The Parallelization And Optimization Of K-means Algorithm Based On Spark
2	Research On Data Mining Technology Based On Spark
3	A System For Distributed MD Data Analysis Based On Spark
4	Research On Spark Oriented Fuzzy C-means Clustering Algorithm
5	Research And Application Of FCM Algorithms Based On Spark
6	Research On Optimization And Parallel Of K-means Algorithm On Spark
7	Research And Application Of K-means++ Algorithm Based On Spark Platform
8	Research On Parallelization Of Data Mining Algorithm Based On Distributed Platforms Spark And YARN
9	Designand Implementation Of Data Stream Clustering Algorithm StreamCKS Based On Spark Streaming
10	Design And Implementation Of A Distributed Hybrid Index Structure Based On Spark