Font Size: a A A

Research On Online Streaming Data Clustering Algorithm Based On Natural Neighbor

Posted on:2020-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:S F MaFull Text:PDF
GTID:2428330599953704Subject:engineering
Abstract/Summary:PDF Full Text Request
The importance of data in the era of big data is getting higher and higher,and the development of real-time data acquisition technology enables data streams to be acquired in various fields during application.A large number of data streams are generated in the fields of weather forecasting,e-commerce,network security,and video surveillance.In these areas,research on data streams and related technologies is critical.Compared with static data,data streams are characterized by time series,infinite data volume,volatility and low value density.The above characteristics of data stream pose a new challenge to cluster mining on data streams: the algorithm should use limited memory to process infinite-scale data sets,the processing efficiency of the algorithm should be fast,and the algorithm should have strong adaptability,including The adaptability of the underlying model of the ever-evolving data stream,the ability to mine clusters of arbitrary shapes.Due to the infiniteness of the total amount of data streams,data stream clustering can only mine a continuous limited data set.The data flow algorithm generally uses a sliding window,a landmark window,and a decay window to select a data set to be mined,and uses a summary data structure to maintain statistical information of the data,so that the mining task can proceed smoothly.The data flow algorithm can also be divided into the following four methods,namely density-based algorithm,grid-based algorithm,partition-based algorithm and layer-based algorithm.The data stream clustering algorithm has many parameters and the parameter values are difficult to determine.The two-stage clustering algorithm represented by CluStream algorithm cannot generate clustering results in real time.Later,the CEDAS algorithm which solves the problem is solved by the CEDAS algorithm.However,the algorithm cannot automatically obtain the microcluster threshold and the search radius.In view of the above problems of data flow,this paper introduces the natural neighbor algorithm.The natural neighbor algorithm is different from the k-natural neighbor algorithm.It does not need to input parameters manually.It can adaptively iterate out the natural eigenvalues of the data set,and considers the distribution of data in natural neighbors.In the algorithm,the number of neighbors in the densely distributed data area is large,and the number of neighbors in the sparse area is small.Through a large number of experiments,this paper finds how to determine the density threshold and the neighborhood radius formula by the natural feature value of the natural neighbor algorithm,and weights the search radius of the micro cluster center point based on the observed natural distribution law of the data set.deal with.By introducing the natural neighbor algorithm into CEDAS,this paper proposes the NaN-CEDAS algorithm.In order to verify the effectiveness of the NaN-CEDAS algorithm,the effectiveness of the algorithm is verified on the artificial dataset and the real dataset.Firstly,several sets of commonly used clustering datasets are used to verify the correctness of the threshold and neighborhood radius obtained by the natural neighbor algorithm.Experiments show that the threshold and neighborhood radius of the algorithm based on the natural neighbor algorithm can correctly calculate the data.Set clustering.Then,through two artificial data stream sets,the algorithm has good micro-cluster merging,micro-cluster separation and the ability to quickly discover new micro-clusters.Finally,the two real data sets of the KDDCUP 99 network attack dataset and the Intel Berkeley Research Laboratory sensor data stream are used to verify the effect of the proposed algorithm on the actual scene.Compared with CEDAS,DenStream and CluStream algorithms,the experimental results show that the algorithm has good experimental results.
Keywords/Search Tags:Stream data, clustering, natural neighbor algorithm, NaN-CEDAS
PDF Full Text Request
Related items