Font Size: a A A

Study Of Distributed Real-time Data Flow Density Clustering Algorithm Based On Storm

Posted on:2019-10-13Degree:MasterType:Thesis
Country:ChinaCandidate:L Y NiuFull Text:PDF
GTID:2428330548983461Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advent of the "big data" era and the popularity of the Internet of Things,an increasing number of traditional static data are replaced by dynamic data streams,and the real-time data mining technology has gradually become a hot topic in the field..Due to the characteristics of data flow such as temporality,continuity,and dynamic change,traditional clustering algorithms have been unable to effectively cluster them.Many researchers have previously conducted research on data stream clustering algorithms and proposed some data stream clustering algorithms.However,due to the complexity of data stream clustering and the diversity of application scenarios,existing algorithms still have room for improvement.The existing algorithms mainly have the following deficiencies:the accuracy of the clustering results is not high enough,the clustering in a distributed environment is difficult,and the high-dimensional performance is poor.This paper uses the classical flow clustering framework and density-based clustering technology to study the data flow clustering algorithm.The main work is as follows:In this paper,based on the classical flow clustering framework CluStream and density clustering algorithm DBSCAN,a data stream density clustering algorithm DBS-Stream is proposed.Aiming at the DBS-Stream algorithm,a Distributed parallelism is raised,and it is implemented on the real-time stream computing platform Storm.In improving the precision,the CluStream two-stage classical framework follows the local site of this algorithm,and the online phase is divided into two parts:local micro cluster clustering and global micro cluster clustering.The online micro clustering local site uses DBSCAN instead of K-means for clustering.Local micro-clusters solve the clustering problem of arbitrary shapes and make local sites update data quickly.The central site then uses DBSCAN algorithm for global clustering,which effectively improves the quality and accuracy of clustering.In terms of distributed parallel flow clustering,in this paper,DBS-Stream algorithm is distributed in parallel design and deployed on Storm flow computing platform,which significantly improves the real-time clustering effect of the flow clustering algorithm.This paper designs contrastive experiments with CluStream algorithm through quality analysis,communication cost,thread pressure,processing time and so on,and carries on experimental analysis.Validation shows that the proposed algorithm has obvious advantages in communication cost and improves the clustering quality and efficiency.Besides DBS-Stream data flow algorithm can deal with arbitrary shape,and there is no bias on the shape of the clustering results,which has some value and practicability theoretically.
Keywords/Search Tags:CluStream, data stream, DBSCAN, density, Storm
PDF Full Text Request
Related items