Font Size: a A A

Design And Implementation Of Distributed Data Mining Algorithms Based On Spark

Posted on:2019-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:Q LuoFull Text:PDF
GTID:2428330569496086Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rise of technology such as cloud computing,social networking and the Internet of things,the vast amount of data is growing and accumulating,so traditional data mining approaches can't meet the needs of users.Therefore,rapid and accurate implementation of big data mining tasks has become a hot topic in the current research field.Compared with the Hadoop platform popular in recent years,which deals with data mining problems inefficiently,Spark,a big data platform based on memory computing,has better advantages for the implementation of data mining algorithm.This dissertation studied the current status of distributed data mining and related technology based on Spark,and proposed two different types of distributed data mining algorithms:(1)the distributed clustering optimization algorithm CK-means which does not consider data partitioning;(2)the distributed outlier detection algorithm VDOD which needs to partition data.The main work is as follows:(1)This dissertation analyzed and summarized the research status of K-means clustering algorithm and distance-based outlier detection algorithm.We studied the related technologies of distributed data mining based on Spark from three aspects: big data processing framework Spark,clustering algorithm and outlier detection algorithm.(2)Firstly,the traditional clustering algorithm K-means and parallel implementation of K-means on Spark are studied.Secondly,aiming at the disadvantage of unstable clustering result,the optimization algorithm CK-means is proposed,in which the Canopy algorithm is used to select the center of clustering initial cluster.The CK-means algorithm selects the Canopy center point based on probability,which improves the stability of the clustering,moreover,increases the expectation of selecting the Canopy center point number in a single step and selects the Canopy center point in parallel,which improves the computing efficiency.At the same time,the implementation of CK-means algorithm is completed on the Spark platform.Finally,through experimental verification,CK-means algorithm has a effective improvement in both computational efficiency and clustering accuracy compared with the K-means on large-scale data sets.(3)Studied DB(k,r)outlier detection method based on distance,a improved distributed outlier detection algorithm named VDOD is proposed.In the data pre-processing stage,a data partitioning method based on variance is designed.This method not only balances the workload of each compute node,but also reduces the damage to the data proximity,thus reducing the network communication amount in the outlier detection.In Outlier detection stage,VODO algorithm uses R index tree to quickly calculate the outlier,and then obtains the final global outlier through the communication of a small number of networks,at the same time,the implementation of VDOD algorithm is completed on the Spark platform.Finally,a large number of experiments prove the validity of VDOD algorithm.Experimental results show that VDOD algorithm can improve computational efficiency and reduce network overhead compared with the existing algorithms.
Keywords/Search Tags:Spark, Distributed Data Mining, K-means, Outlier Detection, Data Partition
PDF Full Text Request
Related items