Font Size: a A A

Analysis Of The Clustering Algorithm On Data Stream Using Resilient Distributed Datasets

Posted on:2017-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2308330485970923Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the popularization and development of Internet applications and the rapid growth of the generated data, most of the data is dynamic data stream which needs to be processed and analyzed in time. Scholars at home and abroad have carried out much exploration and research on the clustering algorithms of data stream. At present, there have been some available clustering algorithms of data stream but there are still many problems, for examples, couldn’t reflect the evolving process of data stream or couldn’t find the clusters of arbitrary shape, or are of low efficiency and so on.In recent years, with the appearance and improvement of new parallel computing platform, the realization of the clustering algorithms on them has been widely concerned and recognized. It provides a new effective way to improve the efficiency of clustering, such as K-Means Streaming that is a clustering algorithm of data stream on Spark. However, due to the short development history of Spark platform, the clustering algorithms of data stream based on Spark are still not many and we only found one case.In this paper, we improve classical density-based DBSCAN algorithm based on the idea of grid method to propose an algorithm GDBSCAN which reduces the time complexity under the premise of preserving the property of finding the clusters distributed in arbitrary shape. Secondly, the effective time of data point is defined to reflect the evolving process of data stream. And combining the advantages of RDD, we provide a parallel implementation of GDBSCAN algorithm on Spark, RDDGD-Stream, which is used to cluster the data stream efficiently in real time. In addition, in order to further improve the efficiency of the algorithm, RDDGD-Stream also designs a repartitioning method based on the number of data points in grids to balance the computing load of each node of the cluster.In order to validate the effectiveness of the GDBSCAN and RDDGD-Stream algorithms, we design a multi set of experiments to investigate from the clustering efficiency (running time and speedup), evolution, and clustering quality and so on. The experimental results show that the efficiency of the GDBSCAN and RDDGD-Stream algorithms is significantly improved, and the clustering quality is improved to a certain extent.
Keywords/Search Tags:Data Mining, Data stream, Clustering, DBSCAN, Spark
PDF Full Text Request
Related items