KDSG-DBSCAN:A High Performance DBSCAN Algorithm Based On K-D Tree And Spark GraphX

Posted on:2018-05-13

Degree:Master

Type:Thesis

Country:China

Candidate:X Gao

Full Text:PDF

GTID:2428330515997782

Subject:Cartography and Geographic Information System

Abstract/Summary:

PDF Full Text Request

Density Based Spatial Clustering of Applications with Noise(DBSCAN)is a spatial clustering algorithm that is widely used in many applications.It has various advantages such as no priori assumption needed about the number of clusters,capable of finding arbitrarily shaped clusters and performs well in the presence of noise data.However,the performance of classical DBSCAN is seriously affected by the cost of distance computing between points and the limitation of single computing units when the size of datasets gets large.To address these problems,this paper introduces a performance improved parallel DBSCAN algorithm based on K-D Tree and Spark GraphX,named KDSG-DBSCAN.This algorithm uses a neighbor search method supported by a K-D Tree index structure to reduce computing cost when computing distance between points.In addition,this new algorithm improves the merging process for local clusters using a parallel graph algorithm of connected component supported by Apache Spark GraphX.The main contents of this paper are as follows:1)Analyze the Performance of classical DBSCAN procedures.Review the theoretical basis of clustering algorithm and the parallel computing framework for large scale data clustering.Then analyzes the performance of classical DBSCAN procedures,and explores the performance bottleneck of this commonly used spatial data clustering algorithm.After that conclusion was made that the performance bottleneck of classic DBSCAN algorithm lies in its serial execution model and iterative inter-point distance calculation.2)Improve the local clustering efficiency based on the index structure.In the DBSCAN local clustering stage,a large number of unnecessary distance calculation between points was processed due to the density assessment,causing a performance bottleneck.To improve this,a K-D Tree index structure is introduced to reduce the complicity of this process based on the neighbor query method.By doing this,only the points within the range of a boundary rectangle was included in the distance computing process,reducing the number of calculations,thus improving the performance of DBSCAN local clustering stage.3)Improve the performance of DBSCAN global merging process of local clusters.The global merging process of local clusters in the DBSCAN algorithm tends to be a performance bottleneck due to the large amount of communication cost between computing nodes and 10 cost of data block transmission.To improve this,we represent the correlation of data points in a common cluster as edges in a graph structure.Then we improve the parallelism of the global merging process by applying a parallel connected components algorithm supported by Spark GraphX,thus reducing the communication and transmission cost and improving the performance.4)Algorithm implementation and experiment verification.We implemented the KDSG-DBSCAN based on Java object-oriented programming language and Spark distributed memory computing framework.Then four experiments was carried out to verify the feasibility,performance,and scalability of this proposed algorithm.We compared the execution times and calculated the speedup ratio between KDSG-DBSCAN,classical DBSCAN,and KDS-DBSCAN.We calculated and compared the execution time costs of three sub-procedures of KDSG-DBSCAN.We compared the performance of KDSG-DBSCAN with growing data sizes;and we analyzed scalability of KDSG-DBSCAN in terms of the number of computing nodes and CPU cores.The algorithm presented in this paper performs well both in terms of speedup ratio and scalability,which provides a feasible method for clustering large scale spatial data.

Keywords/Search Tags:

DBSCAN, K-D Tree, MapReduce, Spark GraphX, large scale spatial clustering

PDF Full Text Request

Related items

1	Research On Parallel Clustering Algorithm For Large - Scale Data Set
2	Research On Parallization Of DBSCAN Clustering Algorithm For Spatial Data Mining Based On Spark Platform
3	Research And Implementation Of DBSCAN Algorithm Based On Spatial Clustering
4	Research On Adaptive Parameter Of DBSCAN Algorithm And Its Application On Spark Platform
5	Study And Realization Of Spatial Data Clustering Analysis System Based On SADBS
6	A Research About DBSCAN Text Clustering Based On Spark Platform
7	Research On Key Techniques Of High Performance Spatial Query Processing For Large Scale Spatial Data
8	Fast Analysis Of Large-scale Wafer Inspection Data
9	Design And Implementation Of Spectral Clustering Algorithm For Large Scale Data
10	Research And Application Of Clustering Algorithms For Large Scale Data