Font Size: a A A

KDSG-DBSCAN:A High Performance DBSCAN Algorithm Based On K-D Tree And Spark GraphX

Posted on:2018-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:X GaoFull Text:PDF
GTID:2428330515997782Subject:Cartography and Geographic Information System
Abstract/Summary:PDF Full Text Request
Density Based Spatial Clustering of Applications with Noise(DBSCAN)is a spatial clustering algorithm that is widely used in many applications.It has various advantages such as no priori assumption needed about the number of clusters,capable of finding arbitrarily shaped clusters and performs well in the presence of noise data.However,the performance of classical DBSCAN is seriously affected by the cost of distance computing between points and the limitation of single computing units when the size of datasets gets large.To address these problems,this paper introduces a performance improved parallel DBSCAN algorithm based on K-D Tree and Spark GraphX,named KDSG-DBSCAN.This algorithm uses a neighbor search method supported by a K-D Tree index structure to reduce computing cost when computing distance between points.In addition,this new algorithm improves the merging process for local clusters using a parallel graph algorithm of connected component supported by Apache Spark GraphX.The main contents of this paper are as follows:1)Analyze the Performance of classical DBSCAN procedures.Review the theoretical basis of clustering algorithm and the parallel computing framework for large scale data clustering.Then analyzes the performance of classical DBSCAN procedures,and explores the performance bottleneck of this commonly used spatial data clustering algorithm.After that conclusion was made that the performance bottleneck of classic DBSCAN algorithm lies in its serial execution model and iterative inter-point distance calculation.2)Improve the local clustering efficiency based on the index structure.In the DBSCAN local clustering stage,a large number of unnecessary distance calculation between points was processed due to the density assessment,causing a performance bottleneck.To improve this,a K-D Tree index structure is introduced to reduce the complicity of this process based on the neighbor query method.By doing this,only the points within the range of a boundary rectangle was included in the distance computing process,reducing the number of calculations,thus improving the performance of DBSCAN local clustering stage.3)Improve the performance of DBSCAN global merging process of local clusters.The global merging process of local clusters in the DBSCAN algorithm tends to be a performance bottleneck due to the large amount of communication cost between computing nodes and 10 cost of data block transmission.To improve this,we represent the correlation of data points in a common cluster as edges in a graph structure.Then we improve the parallelism of the global merging process by applying a parallel connected components algorithm supported by Spark GraphX,thus reducing the communication and transmission cost and improving the performance.4)Algorithm implementation and experiment verification.We implemented the KDSG-DBSCAN based on Java object-oriented programming language and Spark distributed memory computing framework.Then four experiments was carried out to verify the feasibility,performance,and scalability of this proposed algorithm.We compared the execution times and calculated the speedup ratio between KDSG-DBSCAN,classical DBSCAN,and KDS-DBSCAN.We calculated and compared the execution time costs of three sub-procedures of KDSG-DBSCAN.We compared the performance of KDSG-DBSCAN with growing data sizes;and we analyzed scalability of KDSG-DBSCAN in terms of the number of computing nodes and CPU cores.The algorithm presented in this paper performs well both in terms of speedup ratio and scalability,which provides a feasible method for clustering large scale spatial data.
Keywords/Search Tags:DBSCAN, K-D Tree, MapReduce, Spark GraphX, large scale spatial clustering
PDF Full Text Request
Related items