Font Size: a A A

Comprehensive Oversampling And Undersampling Study Of Imbalanced Data Sets

Posted on:2021-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z Z ZiFull Text:PDF
GTID:2428330614463925Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Imbalanced learning has become one of the most popular topics in the current data mining field,and it has a wide range of needs in practical applications such as medical diagnosis,credit card fraud detection,and spam filtering.When dealing with the imbalanced data set classification problem,the classification accuracy of the minority class of samples should be improved as much as possible without causing too much loss in the classification accuracy of the majority class of samples.Based on the research on the inherent characteristics of unbalanced data,this topic combined with the distribution characteristics of practical application data sets and the importance of each feature in the classification process,and improved the commonly used sampling algorithms with better performance to deal with unbalanced data.Effectively make up for the shortcomings in the traditional sampling algorithm,and get new under-sampling and oversampling algorithms.In order to further improve the recognition rate of a small number of samples,a combination of ensemble learning and sampling algorithms was used to finally obtain a complete imbalanced data learning classification system.The main results of this article are as follows:(1)Aiming at the problems that most of the current imbalanced data oversampling algorithms use local information of a few types of samples,the synthesized samples do not conform to the original distribution and are prone to cause noise information to spread.An oversampling algorithm for unbalanced data sets based on sparse representation is proposed.This method uses the global information of the minority class to synthesize samples,and then uses its neighbor information to remove the synthesized samples located in the region of the majority of samples.The research results show that the samples synthesized by the KSOS algorithm are more in line with the distribution of the original data,avoid the propagation of noise information,and improve the recognition performance of a small number of samples.(2)Aiming at the problem that most of the KNN-based unbalanced data undersampling algorithms cannot control the sampling rate and do not consider the effect of outliers,an unbalanced data undersampling algorithm based on K nearest neighbors with outlier removal is proposed.This method first removes most types of samples from areas where most types of samples are dense,obtains balanced data,and then removes outliers by detecting quantile outliers.Experimental results show that KUS can reduce the loss of information for most types of samples to a certain extent and improve the recognition rate of minority types of samples.(3)Aiming at the problem that the classification performance of the RUSBoost algorithm is not stable,which combining random oversampling and ensemble learning,a method combining ensemble learning and clustering-based undersampling is proposed.Although this method is similar to RUSBoost,they use different sampling strategies.Experiments show that the algorithm improves the recognition rate of a few samples.
Keywords/Search Tags:Imbalanced learning, Classification, Under-sampling, Over-sampling, Ensemble learning
PDF Full Text Request
Related items