Font Size: a A A

Research On Data Stream Clustering Method Based On Spark

Posted on:2019-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:Q SunFull Text:PDF
GTID:2428330548486986Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of the information society,no matter in the current network search,Internet of things systems,sensor networks and other areas,or the traditional industries,medical,financial,transportation and other industries,data stream will be generated.Due to the high speed of data generation,large amount of data,and complex data formats,storing these data stream is not only costly,but also difficult to implement real-time analysis.The clustering in data analysis is unsupervised learning.The process is based on the similarity of the data in the collection,making the similarity of the data in the same class high and the difference between different classes as low as possible.Due to the characteristics of stream data generated in real time,the traditional clustering algorithm does not enable real-time analysis.Therefore,how to perform real-time cluster analysis on data stream has become a hot topic in data mining.In recent years,along with the gradual maturity of the big data distributed processing framework,real-time,high-efficiency,and stable analysis of convective data has brought new ideas for implementation.Compared with the Hadoop platform,the Spark platform itself has the advantages of memory-based computing,making the Spark platform cache intermediate results in memory during iterations of algorithm operations,reducing the number of disk reads and saving the running time of the algorithm.According to the advantages of high fault-tolerance and high throughput,Spark becomes one of the widely used computing models in stream data clustering mining.By combining data stream clustering algorithm with distributed memory computing framework,the following aspects are studied in this paper:(1)Firstly,the traditional clustering algorithm and data stream clustering algorithm are analyzed and studied.Different clustering algorithms are divided according to their different characteristics,and the advantages and disadvantages of various algorithms are summarized.The principle and basic architecture of Hadoop and Storm are analyzed.At the same time,the features of the Spark platform and related core modules are analyzed.(2)Secondly,for the characteristics of data stream,based on the traditional algorithm Clustream,SClustream algorithm is proposed.SClustream online micro clustering layer solves the historical data problem by introducing a time decay function;the offline macro clustering layer improves the K-Means algorithm based on the SA algorithm idea,and performs global optimization of K-Means clustering results by SA,to a certain degree.On the optimization of the clustering results,the accuracy of the final SClustream clustering algorithm has been improved.(3)Finally,through the analysis of the Spark platform,the parallelization of Clustream and SClustream optimization algorithms is realized,and the operation is based on the Spark platform.Through a series of analysis and comparison experiments on the two algorithms,the operating efficiency and advantages of the optimization algorithm under the distributed memory computing framework was analyzed and explained in this paper.It shows that the SClustream parallelization algorithm has been improved to some extent in terms of clustering accuracy and speedup ratio compared with Clustream.
Keywords/Search Tags:Spark, Data stream, Cluster, Distributed computing
PDF Full Text Request
Related items