Font Size: a A A

The Parallel Design And Application Of The CURE Algorithm Based On Spark Platform

Posted on:2015-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:R C QiuFull Text:PDF
GTID:2298330422482025Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In recent years, the research of cloud computing with a corresponding rise of the study oflarge data processing platform, the bornning of Hadoop bring people from the MPI(MessagePassing Interface) to the study of MapReduce computation model. By introducing RDD(Resilient Distributed Datasets) model, the processing speed of Spark platform has beengreatly improved, and its interactive computing and iterative calculation is much better thanHadoop, the advantages of good at iterative calculation makes Spark processing platform veryconducive to become a major tool for big data data mining. Data mining is one of the coreprocessing parts with the high processing requirements in BigData area; Spark’s appearance isto meet the needs of the majority of enterprises and scholars. Clustering algorithm is animportant part of data mining. However, the Spark platform nowadays support for clusteringalgorithms is only K-means, given that the algorithm applies only to spherical data sets, thusachieving the goal that any data set can cluster on Spark with the clustering algorithm isnecessary. For CURE clustering algorithm has good effect, and is applicable to any data set,but its computational complexity is relatively high. Therefore, to achieve the CURE algorithmparallelization can improve the effect of clustering on the Spark, and to enrich the clusteringalgorithm in the processing platform for BigData. Currently, smart mobile devices leading themobile Internet which is very hot, all companies of the world pay close attention to it, becauseto seize the mobile market is to seize the key business opportunity, so it is necessary for themobile Internet user’s data mining, thus to provide mobile users with personalized marketingand business recommendations, so that users will be retained to bring benefits for enterprises.Given the reason that domestic and abroad current do little research on implemention ofclustering algorithms on Spark platform and other previously reasons, this article will doresearch on the realization of the parallelazition of CURE algorithm on Spark. Firstly, Sparkplatform has been made a detailed analysis, and data mining algorithms are analyzed andsummarized in this paper, Secondly, CURE algorithm is improved, which named ACUREalgorithm, it using the decentralized representative points selection algorithm to chooserepresentative points that makes the selection of representative points more dispersed than theoriginal CURE algorithm, thus further improve the poly-class effect. Thirdly, the realizationof data parallelism and task parallelism of ACURE on the Spark platform has been researched,and the difference between the two parallel mode has been compared, and drawn theconclusion that these two mode can not be parallelism simultaneously and data parallelism ismuch better. Furthermore, this paper has compared the impact of the partition on ACURE algorithms with data parallelization and compared the performance between stand-aloneprocessing and parallel processing of Spark. Then ACURE based on Spark is applied to dodata mining for the mobile internet BigData, its clustering results and K-means clusteringresults for mobile Internet user’s online behavior were compared, obtained a conclusion thatthe clustering effect of ACURE algorithm is more realistic. Finally, a deep data mining for thedata of mobile Internet users at time, interest, consumption levels and other aspects are doneto provide rich user content for personalized recommendation.
Keywords/Search Tags:Spark, CURE, User BehaviorAnalysis, Custering algorithm, Parallelization
PDF Full Text Request
Related items