Implementation And Application Of Clustering Algorithm Based On Spark

Posted on:2019-06-20

Degree:Master

Type:Thesis

Country:China

Candidate:Z L Zhu

Full Text:PDF

GTID:2428330566499378

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In recent years,how to mine potentially valuable information efficiently in the ocean of data has been a hot research topic in data mining and related fields.Clustering analysis is one of the most popular research topics in the field of data mining.DBSCAN algorithm is a more important density-based algorithm in clustering analysis.It has the advantages of fast speed,efficient processing of "noise" points and discovery of clusters of arbitrary shape,but its timeliness is not good enough when facing big data mining.Apache Spark is today's mainstream big data processing framework which extends the widely used MapReduce computational model by providing a memory-based parallel computing framework.It reduces disk I/O operations by caching intermediate results to memory so that can efficiently support interactive queries,iterative computing and other computing needs.In order to mine the big data better,this thesis studies how to parallelize the DBSCAN algorithm based on Spark platform;designs a parallelization scheme of density clustering algorithm based on Spark.Through the rational use of RDD and design of Sample operator,map function,collectAsMap operator,reduceByKey operator,the scheme realizes the parallelization of the process of finding the density reach data points for the core object.The results of using parallel DBSCAN algorithm on the Spark platform to cluster the UCI Wine data set,Car Evaluation data set and adult data set show that the parallel DBSCAN algorithm has better accuracy and timeliness,and it is suitable for big data clustering.In order to test the practicality of the research results,a simple telecom user classification system is developed in this thesis.The parallel DBSCAN algorithm based on Spark is applied to the telecom user classification module of this system.The user basic information data and behavioral information data are combined,and then the parallel DBSCAN algorithm is used to achieve user group classification.The application results show that the developed system can accurately and efficiently classify the users and provide a basis for the development of a targeted marketing strategy based on the user classification results,which reflects the practical value of the research work.

Keywords/Search Tags:

Clustering, DBSCAN algorithm, Spark platform, parallelization, user classification

PDF Full Text Request

Related items

1	Research On Adaptive Parameter Of DBSCAN Algorithm And Its Application On Spark Platform
2	The Optimization Of Clustering And Classification Algorithms Based On SPARK
3	Research On Parallization Of DBSCAN Clustering Algorithm For Spatial Data Mining Based On Spark Platform
4	Research On Improved DBSCAN Algorithm Based On Spark Platform
5	A Research About DBSCAN Text Clustering Based On Spark Platform
6	KDSG-DBSCAN:A High Performance DBSCAN Algorithm Based On K-D Tree And Spark GraphX
7	Research And Implementation Of Classification Algorithm Parallelization Based On Spark
8	Research On Parallelization Of Classification Algorithm Based On Spark Platform In Telecom Customer Churn Prediction System
9	Research And Application Of Parallelization Optimization Of Spatial Clustering Algorithm Based On Spark
10	The Design And Implementation Of Parallelization Of Canopy And FCM Clustering Algorithms On Spark Platform