Research On The Large Scale Clustering And Its Applications On Anomaly Detection

Posted on:2018-10-02

Degree:Doctor

Type:Dissertation

Country:China

Candidate:M Ye

Full Text:PDF

GTID:1318330563951154

Subject:Military cryptography

Abstract/Summary:

PDF Full Text Request

With the increasingly severe situation of the network security,more and more safety devices and multiple data collecting technologies come into use,which stimulates the urgent demands of the analysis of the emerging data for improving the network security management and protection.The present condition that it is difficult to obtain enough training data makes unsupervised data analysis methods more suitable for large scale network data.As a typical unsupervised method for data analysis,clustering,which partitions the dataset into different subsets based on similarities,can be used to find the internal distribution structure of data as well as serve as a component of other data analysis methods through the obtained data structure.In particular,anomaly detection based on clustering,which not only has the advantage of high efficiency,but also has the capability of detecting unknown attacks without prior knowledge,has become a complement for the traditional rule-based intrusion detection system.However,the network data with large volume and high dimension in the big data era makes the traditional clustering algorithm faced with the challenges of high space and time complexities.Hence,it has become an imperative requirement to make these typical methods of data analysis scalable.As a mainstream kind of scalability technique suitable for large scale data analytics,random sketch,which consists of random sampling and random projection,has become the hot topic of data analysis research.Radom sketch not only has the theoretical superiority of maintaining the effectiveness of analysis with reduced complexity,but also has the computational character of convenient implementation and parallelism.Aiming at the design of clustering algorithms based on random sketch as well as the problem of the sensitivity of the number of clusters and the adaption of stream data encountered in the anomaly detection based on clustering,we carry out the following researches:(1)Fuzzy c means(FCM)clustering based on random projectionFCM clustering which can improve the flexibility of subset partition,is suitable for the situation that the category of network data is fuzzy.However,when dealing with the high dimensional data,the complexity of the corresponding optimization problem of FCM clustering increases remarkably.In addition,the initialization has great effect on the result of FCM clustering,and could make the algorithm output unstable.Aiming at these two problems,we research the FCM clustering with random projection and FCM ensemble clustering:� We analyze the influence of dimensionality reduction with random projection on FCM clustering theoretically.We prove that random projection can maintain the whole variability of data set and is very effective in dimensionality reduction.Together with the property of random projection in the approximate preservation of pairwise distances,we design the improved FCM clustering based on dimensionality reduction with random projection.The performance of new algorithm is also verified by the experimental results.� To avoid the influence of initialization on clustering results,we design the ensemble clustering algorithm through integrating multiple clustering results of the improved FCM clustering algorithm.By the singular value decomposition of concatenated membership matrices,the ensemble clustering algorithm can get the spectral embedding of data set with the linear time and space.The experimental results also manifest that new algorithm has advantages in clustering accuracy,efficiency,and robustness.(2)Spectral clustering algorithm based on sampling of landmark pointsSpectral clustering is suitable for a wide range of geometries,which makes it applicable for the situation of the unknown structure of network data.Through construction of affinity matrix between landmark points and data points,the Landmark-based Spectral Clustering(LSC)algorithm can significantly reduce the computational complexity of spectral embedding.While considering big data problems,the existing generation strategies of landmark points have some deficiencies: the unstable results of random sampling,along with the unknown convergence time and the repeatability of data reading in k-means centers method.We design a rapid landmark-sampling spectral clustering algorithm based on the approximate singular value decomposition,which makes the sampling probability of each landmark point decided by the row vectors' norms of the approximate singular vector matrix.Compared with LSC algorithm based on uniform sampling,the clustering result of new algorithm is more stable and accurate;compared with LSC algorithm based on k-means centers,the new algorithm reduces the computational complexity.Moreover,the preservation of information of original data for the landmark-sampling results is analyzed theoretically.At the same time,the performance of new approach is verified by the experiments over some public data sets.(3)Constraied spectral clustering with random projection and samplingConsidering the fact that unlabeled instances are usually much more than the labeled instances in network data set,it is necessary to improve the quality of the clustering results through the usage of the labeled instances.In addition,ensemble clustering has the advantages of improving clustering quality and adapting to multi-source and heterogeneous data.Hence,we study the constrained spectral clustering with matrix reconstruction and constrained spectral ensemble clustering with random projection:� Combing the reconstruction of adjacency matrix and constrained spectral clustering model,we design a constrained spectral clustering framework that has linear complexity.Through theoretical analysis and experimental results,we prove that the new framework can not only compress the scale of original model,but also maintain the dominating information for partition.The new framework can be applied to both feature data and graph data by means of similarity graph construction based on landmark points and the approximate matrix decomposition.The experimental results show that the new framework has advantages in clustering quality,efficiency and practice.� We first improve the efficiency of the above constrained spectral clustering framework by random sampling uniformly in the phase of k means clustering.Then,together with dimensionality reduction with sparse random projection in spectral ensemble clustering,we design a new constrained spectral ensemble clustering algorithm.We not only prove that the new algorithm can preserve the clustering result approximately from the point of objective function,but also verity the effectiveness and efficiency of the new algorithm experimentally.(4)The applications of designed clustering algorithms on anomaly detectionAnomaly detection has become one of the main techniques for network intrusion detection.We investigate the performance of designed clustering algorithms on anomaly detection and design two new anomaly detection algorithms which aim at the environment of stream data and avoiding the effect of clusters number on detection results respectively:� Combing the designed clustering algorithms and existed methods of the computations of anomaly index,we get several anomaly detection algorithms suitable for different conditions.The performance of new anomaly detection algorithms is also verified by the experiment results on real-world network connection data.� To avoid the influence of cluster number on detection results of anomaly detection algorithms,we combine the designed(constrained)spectral clustering and principal component analysis subspace method for anomaly detection to design a new anomaly detection algorithm that is not sensitive to the clusters number.We also verify the performance of the new algorithm through experiments.� Aiming at the real-time analysis of the network data,we design a streaming anomaly detection algorithm that can use the constraint information effectively through the constrained spectral clustering proposed by us.Through the landmark points,the new streaming algorithm can deliver the information of stream data efficiently.The effectiveness and efficiency of our algorithm on detection of stream data are also verified by the experimental results.

Keywords/Search Tags:

network security, data analysis, large scale data, clustering, anomaly detection, sampling, random projection

PDF Full Text Request

Related items

1	Large-scale Network Anomaly Detection Based On Data Mining
2	Network Anomaly Detection Based On Data Mining Techniques
3	Research On Large Scale Data Clustering Analysis Methods
4	Research On Network Anomaly Detection Based On Projection Pursuit Regression
5	Key Technologies Research On Time Series Data Mining For Large Scale Network Security Situation Analysis
6	Large-Scale Network Traffic Anomaly Analysis
7	Research And Implementation On Large Scale Network Data Stream Anomaly Detection System
8	Research On Weighted Sampling Method In Large Scale Data Clustering
9	Research And Implementation Of The Large Scale Cluster Anomaly Detection Technology And Data Masking Technology
10	Clustering And Anomaly Detection Over Data Streams