Research And Improvement Of Big Data Parallel Clustering Algorithm Based On Spark

Posted on:2019-04-21

Degree:Master

Type:Thesis

Country:China

Candidate:Q Li

Full Text:PDF

GTID:2428330566973375

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

In the wake of ceaseless development of Internet technology,the explosive increase of data has promoted the coming of big data era.Finding data values,by virtue of technologies of data mining,has gained increasing attentions,and clustering analysis is a significant branch amongst data mining technologies.At present,most traditional clustering algorithms run serially in stand-along device and could not satisfy the needs of dealing mass data due to single computer's limits on internal storage,storage space and computing power.However,the development of distributed computing technology offers opportunity for dealing this issue.This paper studies and improves Canopy-Kmeans algorithm and CFSFDP algorithm in the clustering analysis by combing Spark distributed computing framework.The research includes essential views below:(1)Canopy-Kmeans algorithm selects the initial centers randomly and the clustering results are easily effected by parameter.This paper improves these issues by using density peaks and the maximum and minimum principle respectively.Meanwhile,it reduces the influence of noise point towards algorithm and realizes parallel processing of algorithm by adopting Spark framework.(2)CFSFDP algorithm has to select center points through artificial selection of decision graph and this step not only has inaccuracy issue,but also leads to the failure of auto-parallel computing.In this paper,the clustering center points were automatically find by using the idea of slope to calculate the demarcation point of clustering center points and un-clustering center points.Following this,the improved algorithm enables us to confirm clustering center points automatically through calculation and realizes parallel processing of algorithm through Spark framework.By virtue of computer cluster of Spark on Yarn,the experiments show that the improved Canopy-Kmeans algorithm which based on density peaks,and CFSFDP algorithm which selects center points automatically,both have great clustering results and parallel capability.

Keywords/Search Tags:

Spark, clustering, Canopy-Kmeans, CFSFDP, parallel

PDF Full Text Request

Related items

1	The Research Of Parallel Clustering Algorithm Based On Hadoop Platform
2	Research And Implementation Of A Hybird Recommendation System Based On Auto Encoder And Canopy-Kmeans Algorithm
3	The Design And Implementation Of Parallelization Of Canopy And FCM Clustering Algorithms On Spark Platform
4	Research And Application Of Expert Information Knowledge Graph Based On Improved Canopy-Kmeans
5	Design And Realization Of Face Image Retrieval System Based On Spark Framework
6	Research On Clustering Algorithm On Hadoop Platform
7	Parallelizing K-means-based Clustering On Spark
8	Research On Fast Search Density Peak Clustering Algorithm Based On Streaming Computing
9	Research And Application Of Parallel Data Mining Based On Spark
10	Parallel Division Clustering Optimization Algorithm Based On Spark