Research On Data Stream Clustering Algorithm Based On Spark Streaming

Posted on:2017-03-02

Degree:Master

Type:Thesis

Country:China

Candidate:T J Zhi

Full Text:PDF

GTID:2428330566453032

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of computer network and information technology,data stream has become the major data model in network monitoring,financial analysis and medical research.It is of great significance to grasp valuable information promptly and accurately from the real-time,organized and massive data stream.To improve the efficiency and handling capacity of data stream processing,distributed stream processing system emerged at the right moment.Among many of its competitors,the emerging Spark Streaming system was applied effectively with its quasi real-time feature and high handling capacity.To fulfill the demand of cluster mining in massive data stream,this thesis studied and optimized CluStream algorithm based on dual-tier frame and the D-Stream algorithm of grid density,and carried out parallel design under Spark Streaming system.The specific contents of this thesis are as follows:1)To cope with weaknesses of CluStream,this thesis came up with FCPCluStream.When CluStream was doing online micro clustering,the micro cluster structure cannot reflect real-time data flow evolution.This thesis introduced time dilution factor to reduce the influence of historical data on the cluster.Besides,the thesis also improved the clustering center and clustering distance,and optimized the pyramid model for time storage.When the Clustream was running macro clustering,users are required to provide cluster number for micro cluster combining,which would result in low-quality clusters.This thesis delved into the Canopy algorithm to confirm cluster number and initial class cluster center.What's more,Canopy-KMeans was also applied to optimize offline macro cluster combining.Based on the abovementioned improvement,the author put forward FCPCluStream.2)This thesis made parallel design to FCPCluStream based on the Spark Streaming model to improve its efficiency.An overall parallel structure was carried out to FCPCluStream in accordance with features of the Spark Streaming model.During the online clustering stage,map process planning was adopted to micro cluster initialization and micro real-time upgradation.During the offline macro clustering stage,map,combine and reduce process planning were adopted to micro cluster combing based on Canopy-KMeans.3)The thesis studied D-stream on the basis of features of Spark Streaming and put forward and improved the grid zoning method to enhance its execution efficiency.When adopting parallelization processing to D-stream,even distribution of the space grid would result in load unbalance.Hence,this thesis put forward a corresponding algorithm to cope with it.Meanwhile,map process planning was adopted to online grid mapping and offline cluster adjustment.Then,an overall cluster combing method was designed.4)Design the experimental scheme and test performances of clustering algorithm based on Spark Streaming platform.A Spark+YARN platform was created and invasive network data of KDD CUP1999 was applied.By analyzing the cluster quality upon the algorithm,it improved testing acceleration ratio of cluster.this experimental scheme verifies the accessibility and effectiveness of cluster algorithm on the base of data stream of Spark Streaming.

Keywords/Search Tags:

Data Stream, Spark Streaming, CluStream Algorithm, D-Stream Algorithm, Parallel Processing

PDF Full Text Request

Related items

1	Designand Implementation Of Data Stream Clustering Algorithm StreamCKS Based On Spark Streaming
2	Research And Application Of RDF Stream Reasoning Based On Spark
3	Research On Parallel Clustering Algorithm For Streaming Data
4	WSN-oriented Stream Data Clustering Algorithm Research
5	The Research And Implementation Of Data Stream Processing And Analysis Engine Based On DAG
6	Research On GPU Based Data Stream Parallel Processing
7	Research And Application Of Stream Processing For Railway Operation And Maintenance
8	Design And Implementation Of Real-time Streaming Module Based On Spark Streaming
9	Study On Data Stream Techniques And Its Application In Electric Power Information Processing
10	Research On Processing Methods Of Data Stream Based On Parallel Computing