Font Size: a A A

Implementation And Application Of Clustering Algorithm Based On Spark

Posted on:2019-06-20Degree:MasterType:Thesis
Country:ChinaCandidate:Z L ZhuFull Text:PDF
GTID:2428330566499378Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,how to mine potentially valuable information efficiently in the ocean of data has been a hot research topic in data mining and related fields.Clustering analysis is one of the most popular research topics in the field of data mining.DBSCAN algorithm is a more important density-based algorithm in clustering analysis.It has the advantages of fast speed,efficient processing of "noise" points and discovery of clusters of arbitrary shape,but its timeliness is not good enough when facing big data mining.Apache Spark is today's mainstream big data processing framework which extends the widely used MapReduce computational model by providing a memory-based parallel computing framework.It reduces disk I/O operations by caching intermediate results to memory so that can efficiently support interactive queries,iterative computing and other computing needs.In order to mine the big data better,this thesis studies how to parallelize the DBSCAN algorithm based on Spark platform;designs a parallelization scheme of density clustering algorithm based on Spark.Through the rational use of RDD and design of Sample operator,map function,collectAsMap operator,reduceByKey operator,the scheme realizes the parallelization of the process of finding the density reach data points for the core object.The results of using parallel DBSCAN algorithm on the Spark platform to cluster the UCI Wine data set,Car Evaluation data set and adult data set show that the parallel DBSCAN algorithm has better accuracy and timeliness,and it is suitable for big data clustering.In order to test the practicality of the research results,a simple telecom user classification system is developed in this thesis.The parallel DBSCAN algorithm based on Spark is applied to the telecom user classification module of this system.The user basic information data and behavioral information data are combined,and then the parallel DBSCAN algorithm is used to achieve user group classification.The application results show that the developed system can accurately and efficiently classify the users and provide a basis for the development of a targeted marketing strategy based on the user classification results,which reflects the practical value of the research work.
Keywords/Search Tags:Clustering, DBSCAN algorithm, Spark platform, parallelization, user classification
PDF Full Text Request
Related items