| Anomaly detection is an important task in knowledge discovery and machine learning.With the rapid development of the Internet,the security,authenticity and validity of data have become a topic of concern to many people.However,in the face of large data sets,the emergence of abnormal data is inevitable.Therefore,the detection of outliers is of vital importance for solving network security and avoiding risks for enterprises and individuals.The main purpose of anomaly detection is to dig out abnormal data which obviously deviates from the normal pattern or is different from the performance of most samples from the given data,so as to identify the abnormal data and avoid the risk.In the network traffic data,anomaly detection can be used as the "sentinel" of network security,which can give early warning to some network security threats and illegal intrusion.It can also be used for abnormal user identification,traffic cheating detection,abnormal order detection,rare disease identification,fraud detection,loan risk identification and scalper identification,etc.Under the above background,this paper studies the anomaly detection based on the idea of ensemble learning,and proves the effectiveness of the method proposed in this paper through experimental results.At the same time,it proves the applicability of the method from two specific application scenarios.Because of the unbalance of anomaly data,most single models are easy to overfit in anomaly detection.In this paper,through the study of integrated methods such as Isolation Forest,Random Forest,Adaboost,XGBoost and LGBM,it is found that the integrated model can solve this problem well,and has higher detection accuracy than the single model.However,it is difficult to find a suitable integration model based on abnormal data structure in practical application.Therefore,based on the idea of integrated learning,this paper proposes two model fusion anomaly detection methods based on Stacking and Voting,which on the one hand improves the model accuracy and reduces the risk of overfitting,and on the other hand avoids the poor task learning performance caused by improper model selection.At the same time,KDD Cup 1999 data set was used to train the proposed fusion model and the commonly used anomaly detection model,and the evaluation indexes such as macro precision rate,macro recall rate,macro F1-score,Receiver Operating Characteristic curve and the Area Under ROC Curve value were used to evaluate the experimental results.By comparing the experimental results,it is found that the training result of the fusion model is better than that of the single model,and it is the best among all the experimental methods.Furthermore,the AUC value of the fusion algorithm based on Stacking model is higher than that of the fusion algorithm based on voting method,but the model training is relatively time-consuming.In addition,the performance of anomaly detection based on isolated forest is relatively better,and the isolated forest algorithm can be selected for anomaly detection of unlabeled data in practical application.Finally,based on two practical application scenarios of abnormal detection of advertising traffic and abnormal detection of order traffic in network traffic data,this paper further illustrates the applicability of the fusion model by modeling and analyzing the actual data. |