Font Size: a A A

Designand Implementation Of Data Stream Clustering Algorithm StreamCKS Based On Spark Streaming

Posted on:2018-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y X ZhangFull Text:PDF
GTID:2348330536972647Subject:Engineering / Computer Technology
Abstract/Summary:PDF Full Text Request
The need for real-time processing and analysis of massive and sustained high-speed flow data is growing rapidly,it makes centralized streaming data analysis and processing techniques difficult to meet the requirements.On the other hand,with the outbreak of big data,how to extendthe traditional stream processing technology to a distributed computing environment has become one of the hot research directions.This paper will start with the clustering algorithm of data stream,and try to implement a typical data flow clustering algorithm in Spark Streaming,in order to improve the efficiency of the stream clustering algorithm by clustering the data stream with the architecture of the Spark framework itself.The main contents of this paper are as follows:1)Design of distributed Stream Clustering framework based on Spark Streaming: Based on the classic flow clustering two-layer processing framework,the Spark Streaming module was added to its online phase for data stream,and micro-clustering data;in the offline phase,the Spark batch module is mainly used for parallel clustering.2)Implementationof Stream CKS data stream online clustering algorithm based on SSBuf Tree: For data stream and Spark Streaming platform features,the Stream CKS algorithmis proposed based on the original data stream clustering algorithm(Clus Stream, Steam KM ++,etc.).The SSBuf tree was designed for the online module of the Stream CKS algorithm,so that it can solve the high-speed burst of data stream through pre-aggregation operation and caching mechanism,to achieve maintaining data flow summary information.3)Implementationof Stream CKS data flow offline clustering algorithm based on Canopy and K-Means: Using the Canopy algorithm to perform coarse clustering to initialize the k-value and the initial center point in the K-Means algorithm,and then reduce the number of iterations of K-Means algorithm to improve the accuracy and stability of clustering results.4)Stream CKS algorithm optimization based on Spark Streaming platform: For the platform features of Spark Streaming,tuning from system configuration such as data serialization and cache size,which further improves the parallel efficiency and expansibility of Stream CKS algorithm.Finally,we test Stream CKS with real data set.Result shows: compared with the classical Clu Stream and Stream KM ++ algorithms,the Stream CKS algorithm can maintain more clustered clusters,which indicating that it can respond to high-speed data streams;Stream CKS algorithm has a better precision when the number of clustering centers is small;With the increase of nodes Stream CKS algorithm has obvious advantages in high-dimensional data set,with higher speedup and throughput.
Keywords/Search Tags:Data Stream Mining, Clustering algorithm, Distributed Computing, Spark Streaming, K-Means
PDF Full Text Request
Related items