The Parallel Design And Application Of The CURE Algorithm Based On Spark Platform

Posted on:2015-12-12

Degree:Master

Type:Thesis

Country:China

Candidate:R C Qiu

Full Text:PDF

GTID:2298330422482025

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

In recent years, the research of cloud computing with a corresponding rise of the study oflarge data processing platform, the bornning of Hadoop bring people from the MPI(MessagePassing Interface) to the study of MapReduce computation model. By introducing RDD(Resilient Distributed Datasets) model, the processing speed of Spark platform has beengreatly improved, and its interactive computing and iterative calculation is much better thanHadoop, the advantages of good at iterative calculation makes Spark processing platform veryconducive to become a major tool for big data data mining. Data mining is one of the coreprocessing parts with the high processing requirements in BigData area; Spark’s appearance isto meet the needs of the majority of enterprises and scholars. Clustering algorithm is animportant part of data mining. However, the Spark platform nowadays support for clusteringalgorithms is only K-means, given that the algorithm applies only to spherical data sets, thusachieving the goal that any data set can cluster on Spark with the clustering algorithm isnecessary. For CURE clustering algorithm has good effect, and is applicable to any data set,but its computational complexity is relatively high. Therefore, to achieve the CURE algorithmparallelization can improve the effect of clustering on the Spark, and to enrich the clusteringalgorithm in the processing platform for BigData. Currently, smart mobile devices leading themobile Internet which is very hot, all companies of the world pay close attention to it, becauseto seize the mobile market is to seize the key business opportunity, so it is necessary for themobile Internet user’s data mining, thus to provide mobile users with personalized marketingand business recommendations, so that users will be retained to bring benefits for enterprises.Given the reason that domestic and abroad current do little research on implemention ofclustering algorithms on Spark platform and other previously reasons, this article will doresearch on the realization of the parallelazition of CURE algorithm on Spark. Firstly, Sparkplatform has been made a detailed analysis, and data mining algorithms are analyzed andsummarized in this paper, Secondly, CURE algorithm is improved, which named ACUREalgorithm, it using the decentralized representative points selection algorithm to chooserepresentative points that makes the selection of representative points more dispersed than theoriginal CURE algorithm, thus further improve the poly-class effect. Thirdly, the realizationof data parallelism and task parallelism of ACURE on the Spark platform has been researched,and the difference between the two parallel mode has been compared, and drawn theconclusion that these two mode can not be parallelism simultaneously and data parallelism ismuch better. Furthermore, this paper has compared the impact of the partition on ACURE algorithms with data parallelization and compared the performance between stand-aloneprocessing and parallel processing of Spark. Then ACURE based on Spark is applied to dodata mining for the mobile internet BigData, its clustering results and K-means clusteringresults for mobile Internet user’s online behavior were compared, obtained a conclusion thatthe clustering effect of ACURE algorithm is more realistic. Finally, a deep data mining for thedata of mobile Internet users at time, interest, consumption levels and other aspects are doneto provide rich user content for personalized recommendation.

Keywords/Search Tags:

Spark, CURE, User BehaviorAnalysis, Custering algorithm, Parallelization

PDF Full Text Request

Related items

1	Implementation And Application Of Clustering Algorithm Based On Spark
2	Research And Application Of K-means Algorithm Based On Spark
3	Research On Optimization Of Association Rule Apriori Algorithm And Its Parallelization Based On Spark
4	Research And Implementation Of Classification Algorithm Parallelization Based On Spark
5	The Parallelization And Optimization Of Fp-Growth Algorithm Based On Spark
6	The Parallelization And Optimization Of K-means Algorithm Based On Spark
7	Research And Application Of Parallelization Optimization Of Spatial Clustering Algorithm Based On Spark
8	Muriel Spark's comic manifesto: Wit as weapon, tool, and cure
9	The Design And Implementation Of Parallelization Of Canopy And FCM Clustering Algorithms On Spark Platform
10	Research On Improvement Of Recommendation Algorithm Based On Spark