Font Size: a A A

Research On Unsupervised Outlier Detection Approach For Multi-dimensional Sequence Over Data Stream

Posted on:2017-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:D S YangFull Text:PDF
GTID:2428330569998872Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Sequence anomaly detection is designed to discover sequence data that deviates from the normal pattern.The discovery of anomaly data in some scenarios can bring more valuable information to people.In recent years,with the rapid development of computer technology,the detection of anomalies has also been a huge change from the offline scene for the single dimensional sequence anomaly detection to the multi-dimensional sequence anomaly detection over data stream.It has gradually become the current research hotspot and has been widely used in many fields,such as in the field of credit card fraud detection and intrusion detection in computer systems.However,the current work on sequence anomaly detection mainly concentrates on single-dimensional sequence data,and is mainly directed to the scene of offline data.The multi-dimensional sequence anomaly detection under data flow is more challenging than one-dimensional sequence anomaly detection over offline dataset.The main challenges are as follows:(1)The spatial complexity of multi-dimensional sequences increases exponentially as the dimension increases,which leads to the high complexity of data processing,and affects the efficiency and accuracy of anomaly detection;(2)The multi-dimensional sequence data in the data stream will continue to arrive,so we need to improve the detection efficiency;(3)The concept drift will occurs in the data stream,in order to improve the detection rate of the real anomaly,it is necessary to dynamically adjust the detection model.In order to improve the ability of multi-dimensional sequence anomaly detection over data stream,this paper focuses on the technology of dimensionality reduction for multi-dimensional sequence data,high efficiency of sequence anomaly detection and dynamic adjustment of detection model.The dimensionality reduction techniques of multi-dimensional sequences are designed to reduce the spatial complexity of sequence data and improve modeling capabilities.However,in order to guarantee the recognition rate,it is necessary to preserve as many feature information as possible while reducing the space complexity.In this paper,we propose a feature selection method based on mutual information and minimum spanning tree cluster(MIMS).The method can be used to analyze the relationship between the dimensions through mutual information.The clustering method can ensure the low correlation between the clusters and the high correlation in the clusters,and then select the representative feature in each cluster to select the feature.This method can effectively select the representative features,and reduce the complexity of the space while fully preserving the effective information of the multi-dimensional sequence,thus ensuring the abnormal recognition rate.The experimental results show that the feature data selected by MIMS can improve the classification accuracy by 3.2% compared with FCBF and CFS.Highly efficient detection techniques for sequence anomalies are designed to improve the sequence processing efficiency.Traditional sequence anomaly detection methods mainly focus on offline data,which has the disadvantages of modeling and detection time consuming.However,the sequence anomaly detection method over data stream is mainly based on frequent item mining technology.This technique can only be applied to fixed pattern data,and difficult to fully mine the sequence relationship.This paper presents a probabilistic suffix tree based on random sampling and subsequence partitioning outlier detection method(RSOD),it reduces the complexity of training data by random sampling and subsequence partitioning,and can speeds up the construction of the model with an index structure.The above method can effectively reduce the complexity of the model and shorten the modeling time.In the detection phase,the complexity of the model is low,so that an abnormality detection with high efficiency can be realized.The experimental results show that RSOD can reduce the model construction and detection delay by 50% and 34% compared with the traditional PST method,and the anomaly recognition rate can be kept above 91%.The dynamic adjustment technique of the model is designed to reduce the false positive rate by detecting the concept drift,and to guarantee the outlier detection efficiency through the dynamic adjustment of the model.In this paper,we propose a method for outlier buffer based dynamic model adjustment(OBDMA).This method firstly uses conceptual drift detection based on statistic and detection rate to detect concept drift in data stream.After the concept drift is detected,the model will be reconstructed for this type of data distribution.At the same time,the time decay function will dynamically adjust the weight of the model to ensure that anomaly detection efficiency.The experimental results show that the OBMDA method improves the accuracy of concept drift detection by 26% compared with statistically based concept drift detection.To further validate the results of this paper,outlier detection system for multi-dimensional sequence over data stream(ODSMS)is designed and implemented on the streaming platform Storm.The system uses MIMS to select the characteristics of multi-dimensional log audit data.After selecting some representative features,RSOD is used to build the sequence model and detect anomaly data.At the same time,OBDMA is used to detect concept drift and dynamically adjust the model.The modularization design method of ODSMS system reduces the complexity of the interface and improves the robustness of the system by modularizing the feature selection,sequence modeling and anomaly detection.The experimental results show that the ODSMS system can deal with Unix log audit data(BSM structure)smoothly and effectively,and the detection rate of anomalies is kept above 88% and the false positive rate is kept below 7%.
Keywords/Search Tags:data stream, multi-dimensional sequence, anomaly detection, feature selection, probability suffix tree, outlier buffer, concept drift
PDF Full Text Request
Related items