Designand Implementation Of Data Stream Clustering Algorithm StreamCKS Based On Spark Streaming

Posted on:2018-08-11

Degree:Master

Type:Thesis

Country:China

Candidate:Y X Zhang

Full Text:PDF

GTID:2348330536972647

Subject:Engineering / Computer Technology

Abstract/Summary:

PDF Full Text Request

The need for real-time processing and analysis of massive and sustained high-speed flow data is growing rapidly,it makes centralized streaming data analysis and processing techniques difficult to meet the requirements.On the other hand,with the outbreak of big data,how to extendthe traditional stream processing technology to a distributed computing environment has become one of the hot research directions.This paper will start with the clustering algorithm of data stream,and try to implement a typical data flow clustering algorithm in Spark Streaming,in order to improve the efficiency of the stream clustering algorithm by clustering the data stream with the architecture of the Spark framework itself.The main contents of this paper are as follows:1)Design of distributed Stream Clustering framework based on Spark Streaming: Based on the classic flow clustering two-layer processing framework,the Spark Streaming module was added to its online phase for data stream,and micro-clustering data;in the offline phase,the Spark batch module is mainly used for parallel clustering.2)Implementationof Stream CKS data stream online clustering algorithm based on SSBuf Tree: For data stream and Spark Streaming platform features,the Stream CKS algorithmis proposed based on the original data stream clustering algorithm(Clus Stream, Steam KM ++,etc.).The SSBuf tree was designed for the online module of the Stream CKS algorithm,so that it can solve the high-speed burst of data stream through pre-aggregation operation and caching mechanism,to achieve maintaining data flow summary information.3)Implementationof Stream CKS data flow offline clustering algorithm based on Canopy and K-Means: Using the Canopy algorithm to perform coarse clustering to initialize the k-value and the initial center point in the K-Means algorithm,and then reduce the number of iterations of K-Means algorithm to improve the accuracy and stability of clustering results.4)Stream CKS algorithm optimization based on Spark Streaming platform: For the platform features of Spark Streaming,tuning from system configuration such as data serialization and cache size,which further improves the parallel efficiency and expansibility of Stream CKS algorithm.Finally,we test Stream CKS with real data set.Result shows: compared with the classical Clu Stream and Stream KM ++ algorithms,the Stream CKS algorithm can maintain more clustered clusters,which indicating that it can respond to high-speed data streams;Stream CKS algorithm has a better precision when the number of clustering centers is small;With the increase of nodes Stream CKS algorithm has obvious advantages in high-dimensional data set,with higher speedup and throughput.

Keywords/Search Tags:

Data Stream Mining, Clustering algorithm, Distributed Computing, Spark Streaming, K-Means

PDF Full Text Request

Related items

1	Research On Data Stream Clustering Algorithm Based On Spark Streaming
2	Research On Fast Search Density Peak Clustering Algorithm Based On Streaming Computing
3	Analysis Of The Clustering Algorithm On Data Stream Using Resilient Distributed Datasets
4	Research On Data Stream Clustering Method Based On Spark
5	Optimization Of K-means Clustering Algorithm And Its Implementation On Spark Streaming
6	Research On Parallelization Of Data Mining Algorithm Based On Distributed Platforms Spark And YARN
7	Research And Implementation Of Sequential Pattern Mining Algorithm Over Data Streams Based On Spark Streaming
8	Research On Data Mining Technology Based On Spark
9	Research And Realization Of Clustering Algorithm Based On Spark Platform
10	The Parallelization And Optimization Of K-means Algorithm Based On Spark