Font Size: a A A

Research And Application Of Imbalanced Data Classification Based On Oversampling Algorithm

Posted on:2020-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:M LiFull Text:PDF
GTID:2518306308994289Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
The imbalanced data sets have imbalanced distribution among different types of samples,and the numbers vary widely.They are widely used in various scenarios such as network intrusion detection,cancer detection,and spam classification.When the traditional classifier is used to process the data,because there is less information learning for the minority samples,the minority samples are often misclassified,resulting in poor classification results and low accuracy.However,a small number of samples often contain a lot of important information.Therefore,how to improve the classification performance of imbalanced data has important theoretical and practical value.This paper proposes an oversampling algorithm based on the strength of network nodes(Node Strength-SMOTE)from the data level,and combines it with an integrated learning classification algorithm for network intrusion detection.The main work of the paper is as follows:(1)Propose an oversampling algorithm based on network node strengthAiming at the intra-class imbalance problem in imbalanced data,a Node StrengthSMOTE oversampling algorithm is proposed.The model includes three parts:denoising,using the strength of complex network nodes to determine sample generation weights,and synthesizing new samples.First,the KNN is used to reduce noise,filter out the noise samples existing in the minority class,and determine the number of newly generated samples;then use the minority class samples as network nodes,determine the edge weights between the nodes according to the K nearest neighbors,and calculate the node strength.The ratio of the node strength to the total strength is used as the sample synthesis weight to determine the number of new samples generated around the sample.Finally,a roulette processing mechanism is introduced to determine the area where the new sample is synthesized,and the new sample is synthesized using SMOTE interpolation.(2)Experimental simulationThe Node Strength-SMOTE oversampling algorithm is compared with the SMOTE oversampling algorithm,the ADASYN oversampling algorithm,and the KmeansSMOTE oversampling algorithm on 6 UCI unbalanced data sets for simulation experiments.The experimental results show that compared with other oversampling algorithms,the over-sampled data set obtained by the Node Strength-SMOTE algorithm proposed in this paper can obtain better classification results.(3)Propose a network intrusion detection model based on network node intensity oversampling ensemble learningBased on the Node Strength-SMOTE oversampling algorithm,the Ada Boost.M2 integrated learning classification algorithm is applied and applied to the network intrusion detection to build a network node strength oversampling integrated learning(Node Strength-SMOTEBoost)network intrusion detection model.(4)Experimental simulationFinally,the KDD99 dataset is compared with the SMOTEBoost and RUSBoost algorithms.The results show that whether it is for the classification between attack data or between the attack data and normal data(the imbalance between the attack data and between the attack data and normal data Data)have achieved the best classification results,verifying the effectiveness of the Node Strength-SMOTEBoost model in dealing with network intrusion detection.
Keywords/Search Tags:Imbalanced data, Oversampling, Integrated learning, SMOTE, Network intrusion detection
PDF Full Text Request
Related items