Analysis Of The Clustering Algorithm On Data Stream Using Resilient Distributed Datasets

Posted on:2017-01-08

Degree:Master

Type:Thesis

Country:China

Candidate:Y Zhang

Full Text:PDF

GTID:2308330485970923

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the popularization and development of Internet applications and the rapid growth of the generated data, most of the data is dynamic data stream which needs to be processed and analyzed in time. Scholars at home and abroad have carried out much exploration and research on the clustering algorithms of data stream. At present, there have been some available clustering algorithms of data stream but there are still many problems, for examples, couldn’t reflect the evolving process of data stream or couldn’t find the clusters of arbitrary shape, or are of low efficiency and so on.In recent years, with the appearance and improvement of new parallel computing platform, the realization of the clustering algorithms on them has been widely concerned and recognized. It provides a new effective way to improve the efficiency of clustering, such as K-Means Streaming that is a clustering algorithm of data stream on Spark. However, due to the short development history of Spark platform, the clustering algorithms of data stream based on Spark are still not many and we only found one case.In this paper, we improve classical density-based DBSCAN algorithm based on the idea of grid method to propose an algorithm GDBSCAN which reduces the time complexity under the premise of preserving the property of finding the clusters distributed in arbitrary shape. Secondly, the effective time of data point is defined to reflect the evolving process of data stream. And combining the advantages of RDD, we provide a parallel implementation of GDBSCAN algorithm on Spark, RDDGD-Stream, which is used to cluster the data stream efficiently in real time. In addition, in order to further improve the efficiency of the algorithm, RDDGD-Stream also designs a repartitioning method based on the number of data points in grids to balance the computing load of each node of the cluster.In order to validate the effectiveness of the GDBSCAN and RDDGD-Stream algorithms, we design a multi set of experiments to investigate from the clustering efficiency (running time and speedup), evolution, and clustering quality and so on. The experimental results show that the efficiency of the GDBSCAN and RDDGD-Stream algorithms is significantly improved, and the clustering quality is improved to a certain extent.

Keywords/Search Tags:

Data Mining, Data stream, Clustering, DBSCAN, Spark

PDF Full Text Request

Related items

1	Research On Parallization Of DBSCAN Clustering Algorithm For Spatial Data Mining Based On Spark Platform
2	Designand Implementation Of Data Stream Clustering Algorithm StreamCKS Based On Spark Streaming
3	Research On Data Stream Clustering Method Based On Spark
4	A High Dimensional Data Stream Clustering Algorithm Of Quick Dimension Reduction
5	Research On Network Traffic Identification Based On Data Stream Mining
6	Study On Key Technologies Of Frequent Items Mining And Clustering On Data Streams
7	A Density-Based Clustering Algorithm Over Stream Data
8	Research On Dynamic Measurement Based Data Stream Clustering And Its Applications
9	Research On Adaptive Parameter Of DBSCAN Algorithm And Its Application On Spark Platform
10	Adaptive Evolving Data Stream Algorithm Based On Time Decay Window