Font Size: a A A

Two-class Imbalanced Data Classification Based On Diverse Data Generation And Ensemble Learning

Posted on:2021-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:J X QiFull Text:PDF
GTID:2428330620470568Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Classification problem has always been the main research direction in the field of machine learning,and many solutions to classification problem have been proposed by different researchers.However,most of these methods are tailored for class balance problem.In the case of class imbalance,most existing classification algorithms are difficult to classify correctly the minority class samples,but the minority class samples are often very important.Moreover,there are many class imbalance data classification problems in practice,such as credit card fraud detection,disease diagnosis,spam filtering and so on.Therefore,it is necessary to study the problem of class imbalance data classification,and it has important theoretical and application value.According to the number of classes in data set,class imbalance data classification problems are usually divided into two categories: Two-class imbalance data classification and multi-class imbalance data classification.In this paper,two-class imbalance data classification problem is studied,and two diverse oversampling methods based on generative adversarial network(GAN)and an ensemble method based on fuzzy integral for class imbalance data classification are proposed.The main work of this paper includes the following three points:1.A diverse oversampling method based on an improved GAN model is proposed,the proposed method is denoted by GANDO(Generative Adversarial Networks for Diversity Oversampling),In the improved GAN model the discriminator is replaced with a classifier with an output with three entries which are used to predict the input sample is come from majority class,minority class,or is a generated sample.The advantage of the alternative is that the classifier can learn the distribution of minority class samples and also can learn a good classification boundary.As a result,the generated samples can avoid overlap with majority class samples.In addition,a regularization term of intra-class divergence is added into the loss of the generator,which can effectively avoid mode collapse,and the generated samples with the proposed method have good diversity.2.A double discriminator generated adversarial network based oversampling method denoted by D2GAO(Dual Discriminator Generative Adversarial nets for Oversampling)is proposed,D2 GAO use two discriminators to guarantee that generative samples have good diversity.In addition,a classifier is introduced into the model,which is used for learning the difference between positive samples and negative samples,and as a result,which not only can ensure the correctness of the generated positive samples,but also can avoid the overlap between the generated positive samples and negative samples.3.Based on the above two diverse oversampling methods,an ensemble method based on fuzzy integral for class imbalance data classification is proposed.The basic idea of the ensemble method is as follows: for data sets with high imbalanced rate,if too many positive class samples are oversampled,then the oversampled positive class samples will be very dense with serious overlap.In order to deal with this problem,after the positive class samples are oversampled several times,the negative class samples are divided into several subsets according to the size of the oversampled positive class samples.Each negative class subset and the positive class sample set constitute a balanced data set and use it to train a classifier.Finally,these trained classifiers are integrated with fuzzy integral and used for imbalanced data classification.
Keywords/Search Tags:Class imbalance, Generative Adversarial Networks, Oversampling, Diversity, Fuzzy integral, Ensemble learning
PDF Full Text Request
Related items