Font Size: a A A

Research On Dynamic Clustering Algorithm Based On Spark Framework

Posted on:2018-11-16Degree:MasterType:Thesis
Country:ChinaCandidate:B T ZhangFull Text:PDF
GTID:2518305963492634Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
In the information age,real-time streaming data appears more frequently in the data collection scene,the traditional data analysis method can not meet the demand of real-time data processing.In recent years,for the improvement of data stream mining,there have been many useful algorithms.As an important branch of data mining,clustering algorithm in the application of data flow has also made a breakthrough.The traditional clustering method is to make a batch process for the existing data and more consider the accuracy of clustering algorithm.For the clustering of the data stream,because of the massive and real-time characteristics of the data stream,the clustering algorithm needs consider more factors,not only be able to quickly scan the data stream,but also clustering results in real-time response,and for outliers,the determination of the noise point should be very timely and accurate.D-Stream,as a data-flow clustering algorithm based on grid density,adopts the classical online and offline double-layer framework,which can not only guarantee the efficiency and accuracy,but also form arbitrary clustering results.With the progress of information collection technology,the data scale that needs to be processed is more and more large,and it is necessary to design the parallel clustering algorithm for data stream.How to parallelize the traditional data stream clustering algorithm,and do not lose the precision,and improve the operating efficiency is an important issue.At the same time,Storm,Spark,Samza and other distributed data flow processing framework came into being,and quickly get a large-scale application,their open source community is also very active,these frameworks make data flow distributed processing technology more simply use.This paper make the parallelization improvement based on the serial data stream clustering algorithm D-Stream,on the basis of the original algorithm,the concept of block is introduced,and use the large data processing frame Spark.In order to further improve the time efficiency of clustering algorithm,this paper introduces the optimization based on concatenation,which can ensure the accuracy of clustering algorithm and accelerate the speed of generating clustering results and improve the efficiency of the algorithm.Experimental results show that PDStream can be applied to distributed environment,and it has higher efficiency and good expansibility,and can realize dynamic clustering of streaming data under distributed architecture.
Keywords/Search Tags:D-Stream, PDStream, Spark, Dynamic clustering
PDF Full Text Request
Related items