Font Size: a A A

Research On Parallization Of DBSCAN Clustering Algorithm For Spatial Data Mining Based On Spark Platform

Posted on:2018-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:D JinFull Text:PDF
GTID:2348330512983274Subject:Surveying the science and technology
Abstract/Summary:PDF Full Text Request
Density-Based Spatial Clustering of Applications with Noise(DBSCAN),proposed by Ester Martin in 1996,is a density-based clustering algorithm.This algorithm has the characteristics of discovering clusters of any shape,effectively distinguishing noise points and naturally supporting spatial databases.DBSCAN has been widely used in the field of spatial data mining(SDM).However,the processing time of DBSCAN algorithm increases exponentially with the amount of the input data,thus the performance of serial algorithm cannot meet the real-time demand in some large-scale spatial data mining applications.In addressing such challenges,various parallel algorithms on high performance computing(HPC)platforms such as Linux clusters,graphics processing unit(GPU)and Hadoop platforms are have been designed and developed.However,the studies adopting these methods have one or more of the following problems:(1)The traditional HPC platforms are usually expensive,less scalable,and lacking fault tolerance.Some platforms have builtin bottlenecks in data transmission.(2)When developing a parallel algorithm for a multi-iterative clustering algorithm like DBSCAN in Hadoop,there will be frequent reading and writing operations on the distributed file system.Thus,the processing efficiency is significantly reduced with the increase of the amount of input data.As one of the next-generation,general-purpose fast engines for large-scale data processing,Spark can abstract the Resilient Distributed Dataset(RDD)for data storage.With RDD the intemediate data sets do not need to be written to the distributed file system,which can meet the real-time data processing requirements,and ensure high scalability and fault tolerance.As a result,Spark can overcome the weaknesses of the traditional parallel platforms mentioned above.Therefore,based on the principle and implementation of the DBSCAN algorithm in spatial data mining area,this thesis primarily studies its parallel strategies and optimization methods on Spark.The main research objectives are as follows:(1)Systematic analysis and parallelism design for the DBSCAN clustering algorithm used in spatial data mining filed.Based on the implementation of the serial DBSCAN clustering algorithm,the hotspots are detected with the performance analysis tool(Intel VTune),and the appropriate parallelization approach is designed by considering both the characteristics of Spark platform and the DBSCAN algorithm itself.(2)Implementation and optimization for the parallel DBSCAN clustering algorithm on single-node Spark platform.The parallel DBSCAN clustering algorithm on the single-node Spark platform is implemented by taking advantage of Spark parallelization platform and workflow.To improve the efficiency of parallel algorithm,the optimizing measures are proposed from the following three aspects: data transmission,data serialization and resource parameter.Meanwhile,the performance of this version of the parallel DBACAN algorithm is compared against that of an existing OpenMP-based version on the same computing platform.(3)In order to make full use of the computational resources of the Spark nodes,the implementation modes for parallel DBSCAN clustering algorithm in Spark cluster are further discussed.With the help of the Docker containerization technology,the Yet Another Resource Negotiator(Yarn)resource manager,and Mesos resource manager,we implemented different versions of the parallel DBSCAN algorithm.These algorithms can make full use of the hardware resources.Also,the performance comparison between the parallel DBSCAN algorithm on Spark cluster and a comparable algorithm on traditional Hadoop cluster is carried out.(4)In order to validate the effectiveness and efficiency of proposed parallel DBSCAN algorithm based on the Spark platform,the algorithm is applied to congestion detection in a real world urban area.Finally,from the experiments and analyses of the results,we conclude:(1)The parallel DBSCAN algorithm based on single-node Spark platform has better performance than the OpenMP-based version.(2)On the Spark cluster,the parallel DBSCAN algorithm with the Spark On Yarn deployment mode are more suitable for iterative clustering algorithm than the Spark On Mesos.Compared against the parallel DBSCAN algorithm on Hadoop,both versions have substantial improvement in efficiency.(3)Finally,the results of the implemented parallel algorithm in the congestion detection application in urban area further verify the practicability and efficiency of the parallel DBSCAN algorithm based on Spark platform.
Keywords/Search Tags:DBSCAN algorithm, Spatial data mining, Spark platform, Parallel computing, Urban area congestion detection
PDF Full Text Request
Related items