Font Size: a A A

Research On Supervised Learning Based Anomaly Detection For Multi-dimensional Sequence For Data Stream

Posted on:2017-09-14Degree:MasterType:Thesis
Country:ChinaCandidate:H BaoFull Text:PDF
GTID:2428330569498742Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,anomaly detection technology over dynamic high-speed data stream has gradually become a research hotspot.In particular,anomaly detection method for multi-dimensional sequence over data stream is of urgent demand in such fields as Web security,aerospace safety and computer system anomaly diagnosis.Recent research of anomaly detection over data stream is merely focused on single-dimensional sequence,while the research of multi-dimensional sequence is mainly focused on pattern mining in static databases.However,anomaly detection for multi-dimensional sequence over data stream is more challenging because of the following factors that can adversely affect the accuracy and efficiency of the detection: First,because the multi-dimensional sequence data contains a large amount of information,it's much more complex to process it.Second,the problem of concept drift exists in the data stream.Third,the normal data and the abnormal data are always imbalance.In order to improve the accuracy and efficiency of the anomaly detection for multi-dimensional sequence over data stream,this paper deeply studies the multi-dimensional sequence processing technology,feature vector reduction technology over data stream and anomaly detection technology over data stream.The main contributions can be listed as follows:The processing of multi-dimensional sequence data is crucial for anomaly detection for multi-dimensional sequence over data stream.On the one hand,the loss of valid information caused by the processing will reduce the accuracy of anomaly detection.On the other hand,excessive reservation of invalid information caused by the processing will lead to that the detection efficiency cannot meet the requirements of high-speed data stream.Therefore,this paper first proposes a mixed multi-dimensional sequence transformation algorithm MMST.Mixed multi-dimensional sequence transformation algorithm MMST can convert multi-dimensional sequences into fixed-length feature vectors,and treat disordered and ordered dimensions differently in the transformation.During the transformation,MMST can preserving the frequency information of the words in the disorder dimension and both the frequency information and order information of words in the ordered dimension.As the theoretical analysis and experimental results,compared with the co-occurrence matrix based multi-dimensional sequence transformation algorithm CO-OC,MMST can effectively reduce the length of feature vectors,therefore it can improve the efficiency of anomaly detection.What's more,compared with the word-frequency based multi-dimensional sequence transformation algorithm FRE,MMST can preserve the valid information in the multi-dimensional sequence and effectively increase the accuracy of the anomaly detection.Multi-dimensional sequences can be transformed to fixed-length feature vectors by appropriate multi-dimensional sequence processing techniques,but the resulting vectors are usually very sparse,which will impair the efficiency of the anomaly detection.Therefore,this paper proposes an incremental feature selection algorithm IFS.IFS can eliminate feature dimensions with low classification effectiveness according to the information amount and difference degree of each feature dimension in the vector,and reduce the dimension of the feature vectors to improve the efficiency of anomaly detection.Due to the dynamic characteristic of the data stream,the classification effectiveness of each feature dimension will change with the concept drift,thus the incremental feature selection algorithm IFS can incrementally evaluate the classification effectiveness of each feature dimension,and adjust the feature mapping function dynamically in the event of concept drift.Theoretical analysis and experimental results show that the IFS can effectively reduce the dimension of feature vectors,greatly reduces the average update time of the anomaly detection system,and therefore can improve the detection efficiency.Compare to the system without IFS,the throughput of the anomaly detection system with IFS is improved by 42%.At the same time,the accuracy of the anomaly detection isn't significantly affected.After the multi-dimensional sequence transformation and dimension reduction,a cost sensitive support vector machine based anomaly detection algorithm over data stream CBAD is proposed in this paper.CBAD can adaptively set penalty factors for cost-sensitive support vector machine according to the number of normal data and abnormal data in the training set.CBAD can detect anomalies with the cost sensitive support vector machine to improve the detection accuracy over imbalance data stream.In consideration of the problem of label scarcity over data streams,CBAD can pick out test data with large amount of information to label manually,and mix them with selected data of old training set to update the cost sensitive support vector machine.In this way,the label request rate is reduced and the accuracy of anomaly detection is improved gradually.Moreover,CBAD can detect and process the concept drift of data stream in a label-independent manner,thus ensuring the accuracy of anomaly detection.As the experiment results,when the label request rate is only 30%,CBAD can detect anomalies accurately and efficiently over the data stream with concept drifts.In order to further evaluate the theoretical contribution of this paper,a cost sensitive support vector machine based anomaly detection system for multi-dimensional sequence over data stream ADMS is proposed,which is implemented on the distributed stream processing platform Storm and can detect multi-dimensional sequence anomaly in distributed stream processing environment.ADMS first transforms multi-dimensional sequence stream into fixed-length feature vector stream with a mixed multi-dimensional sequence transformation algorithm MMST,and the incremental feature selection algorithm IFS is used to reduce the dimension of the feature vector.And then,ADMS can detect the abnormal multi-dimensional sequences over data stream by monitoring the feature vector stream with the cost sensitive support vector machine based anomaly detection algorithm over data stream CBAD.Experiments show that ADMS can detect abnormal multi-dimensional sequences efficiently and accurately.With the throughput of 199 sequences per second and the label request rate of only 30%,the false negative rate(FNR)of the ADMS is lower than 5%,the false positive rate(FPR)is lower than 7%,.In addition,over the data stream with concept drifts,ADMS can achieve a high detection accuracy too.
Keywords/Search Tags:Data Stream, Anomaly Detection, Multi-dimensional Sequence, Concept Drift, Feature Selection, Active Incremental Learning
PDF Full Text Request
Related items