Research On Dynamic Clustering Algorithm Based On Spark Framework

Posted on:2018-11-16

Degree:Master

Type:Thesis

Country:China

Candidate:B T Zhang

Full Text:PDF

GTID:2518305963492634

Subject:Electronics and Communications Engineering

Abstract/Summary:

In the information age,real-time streaming data appears more frequently in the data collection scene,the traditional data analysis method can not meet the demand of real-time data processing.In recent years,for the improvement of data stream mining,there have been many useful algorithms.As an important branch of data mining,clustering algorithm in the application of data flow has also made a breakthrough.The traditional clustering method is to make a batch process for the existing data and more consider the accuracy of clustering algorithm.For the clustering of the data stream,because of the massive and real-time characteristics of the data stream,the clustering algorithm needs consider more factors,not only be able to quickly scan the data stream,but also clustering results in real-time response,and for outliers,the determination of the noise point should be very timely and accurate.D-Stream,as a data-flow clustering algorithm based on grid density,adopts the classical online and offline double-layer framework,which can not only guarantee the efficiency and accuracy,but also form arbitrary clustering results.With the progress of information collection technology,the data scale that needs to be processed is more and more large,and it is necessary to design the parallel clustering algorithm for data stream.How to parallelize the traditional data stream clustering algorithm,and do not lose the precision,and improve the operating efficiency is an important issue.At the same time,Storm,Spark,Samza and other distributed data flow processing framework came into being,and quickly get a large-scale application,their open source community is also very active,these frameworks make data flow distributed processing technology more simply use.This paper make the parallelization improvement based on the serial data stream clustering algorithm D-Stream,on the basis of the original algorithm,the concept of block is introduced,and use the large data processing frame Spark.In order to further improve the time efficiency of clustering algorithm,this paper introduces the optimization based on concatenation,which can ensure the accuracy of clustering algorithm and accelerate the speed of generating clustering results and improve the efficiency of the algorithm.Experimental results show that PDStream can be applied to distributed environment,and it has higher efficiency and good expansibility,and can realize dynamic clustering of streaming data under distributed architecture.

Keywords/Search Tags:

D-Stream, PDStream, Spark, Dynamic clustering

Related items

1	Research On Incremental Clusteirng Algoirthm Based On PDStream
2	Research On Data Stream Clustering Method Based On Spark
3	Designand Implementation Of Data Stream Clustering Algorithm StreamCKS Based On Spark Streaming
4	Research On Data Stream Clustering Algorithm Based On Spark Streaming
5	Analysis Of The Clustering Algorithm On Data Stream Using Resilient Distributed Datasets
6	Research And Application Of RDF Stream Reasoning Based On Spark
7	Design And Implementation Of Real-time Streaming Module Based On Spark Streaming
8	Research On Dynamic Measurement Based Data Stream Clustering And Its Applications
9	Research Of The Clustering Algorithm Based On The Spark
10	Research On Algorithm And Application Of Dynamic Grid-based Clustering Over Data Stream