Font Size: a A A

Research And Implementation Of Stream Data Clustering Algorithm Based On Storm

Posted on:2020-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:L L WangFull Text:PDF
GTID:2428330575487990Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the continuous development of the information society and the wide application of Internet technology,stream data as the most important data model has been widely used in various fields such as network communication,aeronautical engineering,financial market,and e-commerce.Clustering analysis is an effective data mining method,which can realize clustering based on the similarity principle and achieve the purpose of data analysis.However,stream data is massive and real-time,which makes these traditional clustering algorithms unable to meet processing requirements.Therefore,the research of stream data clustering algorithm becomes more and more important in the field of data mining.Because stream data has the features of infinity,real-time and volatility,etc.,more requirements of stream data clustering algorithms become high.These stream data clustering algorithms should not only be able to process massive amounts of data in time,but also can accurately cluster stream data and improve the accuracy of data analysis.It mainly includes three aspects.Firstly,how to effectively process high-dimensional stream data.Secondly,how to accurately identify outliers and eliminate their impact on clustering effects.Thirdly,how to timely handle the historical data,and improve the clustering accuracy.In view of these problems,the main research contents of this thesis are as following.(1)In the process of dealing with high-dimensional and massive stream data,these clustering algorithms usually are low clustering efficiency and poor real-time performance.This thesis proposes a dimensionality reduction algorithm that called DP-OPCA by improving Principal Component Analysis.The DP-OPCA algorithm processes these data using the mean-dealing method,and improves the calculation process of correlation coefficient matrix of the PCA algorithm according to Pearson method,and is realized by distributed and parallel scheme.Experimental results show that the DP-OPCA algorithm can effectively reduce the dimension of high-dimensional stream data.(2)In order to improve the ability of stream data clustering algorithm to recognize outliers correctly,this thesis improves the CluStream algorithm and proposes OD-CluStream algorithm.The OD-CluStream algorithm defines new concepts of micro-cluster gravity and temporal clustering feature vector of micro-cluster,and improves the radius formula of micro-cluster,and uses the detection method calledLDOF to identify abnormal data.In addition,the OD-CluStream algorithm sets up an abnormal buffer processing mechanism to correctly judge whether abnormal data is really outlier by giving it the growth observation period of m time,so as to achieve the purpose of accurate clustering.(3)In order to process historical data in time and eliminate their unnecessary influence on the current clustering effect,the OD-CluStream algorithm proposed in this thesis improves the clustering quality by introducing the attenuation function and assigning weights to each data to remove these stale micro-clusters according to changes of temporal micro-cluster weights.(4)This thesis deploys the DP-OPCA algorithm on the stream data processing platform Storm to realize dimensionality reduction of stream data.And this paper also realizes a parallel scheme of OD-CluStream algorithm on the platform Storm.In addition,based on the platform Storm,this thesis verifies feasibility and effectiveness of the OD-CluStream algorithm.
Keywords/Search Tags:Stream data, Clustering algorithm, Parallel, Storm
PDF Full Text Request
Related items