Font Size: a A A

Research Of Unsupervised Anomaly Detection Methods Based On Ensemble Machine Learning Model

Posted on:2021-10-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:1488306458977019Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the coming of the big data,the lack of data will no longer be troubled.In contrast,there is a growing concern about the quality of data.A few researchers begin to develop method design and theory analysis to mine most valuable information from a large number of data.Anomaly detection is one of major hot topics among them,which can effectively detect and recognize anomalies that have significant differences with most samples.It has been widely used in multiple domains,such as intrusion detection of network security,fault detection of machine equipment,cancer cell identification of medical images,credit card fraud detection of financial industry and so on.Most anomaly detection studies focus on design a specific method of a certain domain.Therefore,existing anomaly detection methods cannot successfully detect multiple anomalies in various domains and have poor generalization ability.In fact,most of anomalies will be hard to obtain in the practical application,and novel anomalies will also appear during the detection process.The generalization performance is particularly important for anomaly detection.Naturally,it is an important task to design an anomaly detection method that can be effective to solve anomaly detection problems in various domains.Ensemble learning obtains better performance than single method by combining advantages of multiple algorithms.This technology has shown good performance in traditional machine learning problems,such as classification and clustering,and has been verified the effectiveness in improving the generalization ability of related methods.However,the research of applying ensemble learning to anomaly detection(called anomaly ensemble)is extremely challenging.The missing of data label and the extreme imbalance of data categories for anomaly detection are two main obstacles of developing anomaly ensemble methods.Existing studies use ensemble learning to improve the generalization performance of one or more anomaly detection methods.Especially,they merely treat anomaly ensemble as a simple combination problem and usually ignore the model training phase leading to limited generalization ability.To further improve the generalization ability of anomaly ensemble methods,this paper focuses on the training phase of base detection model of anomaly ensemble,and conducts on a systematic study and analysis from four aspects,namely ensemble data preparation,ensemble model training,ensemble model combination,and ensemble learning framework.The main contributions of this paper are summarized as follows:1)It is necessary to select most representative normal samples from the original dataset to prepare the training of each anomaly ensemble component.This paper proposes an ensemble based joint training framework for anomaly detection to implement the iterative optimization of sample preprocessing and anomaly scoring.This method builds an optimization model that contains the computation of sample weight and the evaluation of sample abnormality.Specifically,the anomaly score obtained by the latter is used for the computation of the sample weight of the former(i.e.,the samples with high abnormal probability are given a small weight),and the dataset with different value of weight generated by the former can effectively avoid the performance degradation of the latter,where the degradation is mainly caused by the interference of abnormal samples in the training phase.First,to compute sample weight,a prior knowledge based regularization term is proposed for the objective function.Second,to achieve the goal that the abnormal samples have higher abnormal scores than the normal samples,an anomaly score based hinge loss function is also added in this objective function.Finally,an alternative iterative method is designed to optimize the ensemble model.Experimental results based on various anomaly detection datasets show that the proposed method has a great performance improvement when compared with the popular algorithms.2)In the model training phase,it is an effective way to build a good anomaly ensemble method by fully considering the diversity requirement.This paper proposes a diversity aware based sequential anomaly ensemble method to further improve anomaly detection method by strengthening the model diversity.The ensemble diversity can be divided into two parts: sample diversity and model diversity.For the sample diversity,subsampling is utilized to generate the sample diversity in the primary phase.For the model diversity,an ensemble based optimization model is designed to further improve the diversity of each ensemble component.In addition,this paper proposes an unsupervised diversity measurement method to realize the quantitative evaluation of diversity,and designs an anomaly pruning strategy to remove the pseudo abnormal samples in the training phase.Combined with sample diversity and model diversity,the proposed method has exhibited better generalization performance.Meanwhile,compared with various algorithms on multiple datasets,this method shows better anomaly detection results.3)In the model combination phase,it is important to improve the final ensemble performance by employing a reasonable combination of multiple ensemble components.This paper proposes a bi-level ensemble learning based unsupervised anomaly detection method to further improve the generalization performance of the algorithm and reduce the information loss caused by subspace sampling.The two level combination strategy of this method contains two main components: internal integration and external integration.The first level is internal integration utilized to reduce information loss,and the second level is external integration used to improve generalization ability.In addition,a diversity loss function is designed to realize the model retraining of the first layer.A novel weighted combination strategy is proposed to ensure an effective integration of the second level.Based on the two-level learning strategy,the proposed method has shown different degrees of performance improvement,not just on datasets with high or low dimensions,but on datasets with small or large size samples.4)In the learning framework,it is necessary to realize the joint optimization of the data preprocessing technology,model training skills and model combination strategy to further improve the generalization performance of anomaly ensembles.This paper designs an eager model based unsupervised sequential ensemble framework to consider these three components in a unified learning framework,and proposes a non-metric local anomaly score based adaptive ensemble method to instantiate this framework.First,a Chi-square distribution based sampling method is utilized to initialize the reference model.Second,a weighted Mahalanobis distance based non-metric anomaly evaluation method is proposed,where the weighted sum of local distances of multiple feature subsets is used to replace the global distance and form the final anomaly score.Finally,an anomaly ranking based adaptive composition strategy is designed to effectively combine the results of multiple ensemble components.Based on multiple comparative experiments,the proposed method not only works well on common static datasets,but also shows considerable potentials on dynamic datasets.In general,this paper focuses on four important components of anomaly ensemble,namely data preparation,ensemble model training,ensemble model combination,and ensemble learning framework,deeply analyzes the possible challenges and shortcomings of each component,designs several anomaly ensemble methods,proposes a general sequential ensemble framework,and obtains comparable anomaly detection performance.Therefore,this paper will be one of important references for the in-depth research in the future.
Keywords/Search Tags:Anomaly detection, Ensemble learning, Unsupervised learning, Joint training, Diversity, Generalization ability, Bi-level ensemble, Sequential ensemble
PDF Full Text Request
Related items