Font Size: a A A

Research On Data Stream Clustering Algorithm Based On Spark Streaming

Posted on:2017-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:T J ZhiFull Text:PDF
GTID:2428330566453032Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of computer network and information technology,data stream has become the major data model in network monitoring,financial analysis and medical research.It is of great significance to grasp valuable information promptly and accurately from the real-time,organized and massive data stream.To improve the efficiency and handling capacity of data stream processing,distributed stream processing system emerged at the right moment.Among many of its competitors,the emerging Spark Streaming system was applied effectively with its quasi real-time feature and high handling capacity.To fulfill the demand of cluster mining in massive data stream,this thesis studied and optimized CluStream algorithm based on dual-tier frame and the D-Stream algorithm of grid density,and carried out parallel design under Spark Streaming system.The specific contents of this thesis are as follows:1)To cope with weaknesses of CluStream,this thesis came up with FCPCluStream.When CluStream was doing online micro clustering,the micro cluster structure cannot reflect real-time data flow evolution.This thesis introduced time dilution factor to reduce the influence of historical data on the cluster.Besides,the thesis also improved the clustering center and clustering distance,and optimized the pyramid model for time storage.When the Clustream was running macro clustering,users are required to provide cluster number for micro cluster combining,which would result in low-quality clusters.This thesis delved into the Canopy algorithm to confirm cluster number and initial class cluster center.What's more,Canopy-KMeans was also applied to optimize offline macro cluster combining.Based on the abovementioned improvement,the author put forward FCPCluStream.2)This thesis made parallel design to FCPCluStream based on the Spark Streaming model to improve its efficiency.An overall parallel structure was carried out to FCPCluStream in accordance with features of the Spark Streaming model.During the online clustering stage,map process planning was adopted to micro cluster initialization and micro real-time upgradation.During the offline macro clustering stage,map,combine and reduce process planning were adopted to micro cluster combing based on Canopy-KMeans.3)The thesis studied D-stream on the basis of features of Spark Streaming and put forward and improved the grid zoning method to enhance its execution efficiency.When adopting parallelization processing to D-stream,even distribution of the space grid would result in load unbalance.Hence,this thesis put forward a corresponding algorithm to cope with it.Meanwhile,map process planning was adopted to online grid mapping and offline cluster adjustment.Then,an overall cluster combing method was designed.4)Design the experimental scheme and test performances of clustering algorithm based on Spark Streaming platform.A Spark+YARN platform was created and invasive network data of KDD CUP1999 was applied.By analyzing the cluster quality upon the algorithm,it improved testing acceleration ratio of cluster.this experimental scheme verifies the accessibility and effectiveness of cluster algorithm on the base of data stream of Spark Streaming.
Keywords/Search Tags:Data Stream, Spark Streaming, CluStream Algorithm, D-Stream Algorithm, Parallel Processing
PDF Full Text Request
Related items