Font Size: a A A

Research On The Classification Problem Of Imbalanced Dat

Posted on:2023-12-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y F GengFull Text:PDF
GTID:2567307094489494Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
Classical machine learning algorithms usually assume that the number of samples in each category in the data set is roughly equal,so that the overall classification performance can be purely pursued.But in real problems,most of the data are unbalanced,and a small number of positive samples are often the focus of attention.Therefore,it is necessary to study the classification of unbalanced data.Researchers have been paying attention and proposed many methods to solve this problem,which can be roughly divided into two levels: data level and algorithm level.The data level is to balance the two types of samples by increasing or reducing the number of samples;the algorithm level is to make the algorithm no longer treat the cost of the two types of misclassification equally,but pay more attention to the one with the higher cost of misclassification,so that the model Increase the weight on positive samples.Aiming at the problem of poor classification effect of imbalanced data,this paper innovatively proposes a Bs-Ru sampling method that combines the Borderline SMOTE oversampling method that strengthens the training boundary samples and the random undersampling method that can set the sampling strategy.A comparative analysis with 10 commonly used sampling methods is carried out on different public datasets.Then,empirical analysis is carried out on the Credit Card customers data set with a positive and negative sample ratio of approximately 1:7.After sampling,the improvement effect of each integrated classification model and individual classification model is compared and analyzed.The study found that the Bs-Ru sampling method is better than other common sampling methods compared with it in terms of F-score,recall rate and AUC value,which verifies the feasibility and superiority of the Bs-Ru sampling method.In addition,through the comparative analysis of the classification algorithms,it is found that the Stacking layered model has the best performance,and together with the Bs-Ru sampling method,the optimal classification effect is obtained.The accuracy rate,F_2 value,recall rate and AUC value can reach 0.966,0.907,0.926 and 0.948.It can be seen that the classification problem of imbalanced data can be effectively improved by the combined use of data sampling and integration algorithms.
Keywords/Search Tags:Imbalanced data, Binary classification, Data sampling, Ensemble learning
PDF Full Text Request
Related items