Research And Application Of K-means Clustering Algorithm Based On Distributed Computing Platform

Posted on:2019-01-26

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Liu

Full Text:PDF

GTID:2428330590465786

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the wide application of mobile Internet and Internet of things,the volume of data has been growing rapidly in various industries.How to extract useful information from massive data has become a hotspot in big data.As a simple but efficient clustering algorithm in data mining,K-Means algorithm is widely used in data analysis and processing.The biggest challenge of K-Means clustering algorithm lies in the selection of center points,the reduction of local optimal condition and the number of iterations to improve the quality and efficiency of clustering.Therefore,it is of great significance for real time analysis to optimize the K-Means algorithm in current distributed computing environment.Based on the distributed parallel computing framework,the thesis has a deep research of the K-Means algorithm and makes some optimizations for Parallel implementation.The main works are as follows:1.Some problems like local optimum,unstable clustering results,and excessive number of iterations in clustering process always appear in the native K-Means algorithm.To solve these problems,a random maximum minimum distance K-Means optimization algorithm is proposed.The optimization algorithm was implemented under Hadoop with MapReduce.Compared with the native K-Means and the maximum minimum distance K-Means through experiments,the optimized algorithm can reduce the iterations number of initial point selection,improve the accuracy of clustering,and gain low latency at the same time.2.There still exist some limitations in main work 1 when looking for the initial center points,a random sampling cluster center selection method based on the Spark framework is further proposed.Compared with the native and optimized K-Means algorithms through experiments under Hadoop and Spark respectively,the optimized algorithm in Spark shows a better scalability and lower latency.In summary,the optimized algorithm of K-Means and its parallel implementation based on distributed computing platform can effectively improve the accuracy,recall rate and execution efficiency of the algorithm,and has a better acceleration ratio and scalability.It is more suitable for the clustering analysis in big data environment.

Keywords/Search Tags:

K-Means Clustering Algorithm, Random Maximum and Minimum Distance, Random Sampling, Distributed Computing Platform, Parallel Implementation

PDF Full Text Request

Related items

1	Research On The Parallel Clustering Algorithm Based On MapReduce
2	Research On Text Clustering And Its Application In Topic Detection Analysis
3	Application Research Of Improved K-means Algorithm In Big Data Clustering
4	Parallel Clustering Algorithm Based On MapReduce
5	Research On K-Means Algorithm Based On MapReduce
6	Design And Implementation Of Trade And Industry Bureau Subject Supervision System Based On Double Random Sampling
7	Distributed SVM Algorithm With K-means
8	Research And Application Of Random Walk Algorithm Based On Distance
9	Parallel Research And Application Of Machine Learning Algorithm Based On Cloud Platform
10	The Research On Parallel Computing Technology In Precise Agricultural Climate Division