Font Size: a A A

Research On Hybrid Sampling Of Imbalanced Data Based On Data Distribution

Posted on:2022-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:X ChenFull Text:PDF
GTID:2518306317994049Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The imbalanced data classification problem is one of the research hotspots at present,which widely exists in many fields such as disease detection,financial fraud,intrusion detection and so on.The main characteristic of imbalanced data is that the number of minority samples are small and it is difficult to identify them,but the imbalanced data are often of higher value.Therefore,the imbalanced data learning is to improve the recognition rate of minority samples without affecting the overall recognition rate.At present,methods to deal with imbalanced data are mainly composed of data level method and algorithm level method.Under-sampling and over-sampling at data level are effective methods to deal with imbalanced data,but single under-sampling may cause the loss of valuable information,and single over-sampling may lead to over-fitting and aggravate the overlap between classes and so on.In this paper,we focus on the data level method,and take hybrid sampling combined under-sampling with over-sampling as the framework of imbalanced data sampling,and study the problem of binary imbalanced data classification.The main research work is as follows:(1)Considering the local data distribution,a hybrid sampling method based on the local distribution of the data is proposed.Because safe sample screening can keep as many samples as possible,which are useful for constructing decision boundary,this paper firstly uses this method to under-sample the data,which avoids the loss of valuable information to some extent.Then,the Weighted-SMOTE method is used for oversampling,which balances the dataset by and large.Experimental results on experimental datasets show that this method can effectively improve the correct recognition rate of minority samples.(2)Aiming at the problem that over-sampling method may cause overfitting and aggravate the overlap between classes,a hybrid sampling method based on the overall distribution of the data is proposed.Generative adversarial network can learn the overall distribution of the data,and generate samples according to the learned data distribution,thereby avoiding over-fitting and mitigating the overlap between classes to a certain extent.Firstly,the model is trained on the minority samples,thus the generator can learn the data distribution of the minority class and the discriminator has certain recognition ability.Then,according to the difference between the minority samples and the majority samples,the majority samples are under-sampled.Finally,the generator is used to generate samples which basically follow the data distribution of the original minority class.Compared with the general over-sampling methods,this method can mitigate over-fitting and over-lapping between classes.The effectiveness of this method is verified on some datasets.(3)The hybrid sampling method proposed in this paper is applied to the intrusion detection.The purpose of intrusion detection is to identify intrusion events from a large amount of data.The intrusion detection data have the characteristics of imbalanced data,so it can be treated as an imbalanced data classification problem.Therefore,this paper applies the proposed hybrid sampling method to the intrusion detection.The experimental results show the effectiveness of the hybrid sampling method in the application of intrusion detection.Hybrid sampling method for imbalanced data can effectively alleviate the impact of class imbalance for classification performance and improve the recognition rate of the minority samples without affecting the overall recognition rate.This research is of great theoretical and practical significance to the imbalanced classification problem.
Keywords/Search Tags:imbalanced data, hybrid sampling, safe sample screening, generative adversarial network, intrusion detection
PDF Full Text Request
Related items