For the traditional classification model,we usually assume that the number of different categories in the data is balanced.However,in real life,for example,in the fields of health care,insurance,finance and so on,there are more and more unbalanced data,that is,the number of samples in some categories is far less than that in other categories,which makes the classification model tend to judge a few categories as many for high accuracy It is difficult to predict the number of classes.For example,in an extreme case,there is a binary data set with an imbalance rate of 99%,in which most classes account for 99% and a few classes account for 1%.In order to improve the accuracy,the classifier will divide all samples into most classes,thus only producing an error rate of 1%.If this happens in medical diagnosis,infectious genes are usually much less than non infectious genes,and the prediction model tends to judge the genes causing infection as non infectious,it will bring danger to people’s lives.This paper discusses how to deal with unbalanced data sets under supervision.The unbalanced data sets used are binary data sets.Firstly,the unbalanced data sets are divided into training set and test set,and then GAN is applied to the training set based on the existing minority samples,The model generates indistinguishable samples,so that the number of samples between the two classes is consistent.The two classes of samples are combined to get a new balanced data set,and the XGboost classifier is used for training and modeling.Finally,the model is tested on the original unbalanced data test set and the AUC value is recorded.At the same time,compared with the results of classical S MOT E method and clustering undersampling method,Our method has better performance in the result of AUC and improves the value in practical application. |