Research On The Classification Problem Of Imbalanced Dat

Posted on:2023-12-18

Degree:Master

Type:Thesis

Country:China

Candidate:Y F Geng

Full Text:PDF

GTID:2567307094489494

Subject:Applied statistics

Abstract/Summary:

PDF Full Text Request

Classical machine learning algorithms usually assume that the number of samples in each category in the data set is roughly equal,so that the overall classification performance can be purely pursued.But in real problems,most of the data are unbalanced,and a small number of positive samples are often the focus of attention.Therefore,it is necessary to study the classification of unbalanced data.Researchers have been paying attention and proposed many methods to solve this problem,which can be roughly divided into two levels: data level and algorithm level.The data level is to balance the two types of samples by increasing or reducing the number of samples;the algorithm level is to make the algorithm no longer treat the cost of the two types of misclassification equally,but pay more attention to the one with the higher cost of misclassification,so that the model Increase the weight on positive samples.Aiming at the problem of poor classification effect of imbalanced data,this paper innovatively proposes a Bs-Ru sampling method that combines the Borderline SMOTE oversampling method that strengthens the training boundary samples and the random undersampling method that can set the sampling strategy.A comparative analysis with 10 commonly used sampling methods is carried out on different public datasets.Then,empirical analysis is carried out on the Credit Card customers data set with a positive and negative sample ratio of approximately 1:7.After sampling,the improvement effect of each integrated classification model and individual classification model is compared and analyzed.The study found that the Bs-Ru sampling method is better than other common sampling methods compared with it in terms of F-score,recall rate and AUC value,which verifies the feasibility and superiority of the Bs-Ru sampling method.In addition,through the comparative analysis of the classification algorithms,it is found that the Stacking layered model has the best performance,and together with the Bs-Ru sampling method,the optimal classification effect is obtained.The accuracy rate,F＿2 value,recall rate and AUC value can reach 0.966,0.907,0.926 and 0.948.It can be seen that the classification problem of imbalanced data can be effectively improved by the combined use of data sampling and integration algorithms.

Keywords/Search Tags:

Imbalanced data, Binary classification, Data sampling, Ensemble learning

PDF Full Text Request

Related items

1	Research On Classification Of Imbalanced Datasets Based On Random Forest
2	Research On Classification Of Unbalanced Data Sets Based On Hybrid Sampling And Ensemble Learning
3	Research On High Dimensional Imbalanced Data Classification In The Identification Of Risk User
4	Research On Unbalanced Data Classification Based On Ensemble Learning
5	An Empirical Study On Data Sampling Of Unbalanced Classification
6	Research On Ensemble Learning Algorithm Of Classification Based On Cost-sensitive
7	A Study On Classification Of Imbalanced Data And Evaluation Metrics
8	Kolmogrov-Smirnov Learning By Neuron Networks With A Nonconvex Surrogate Loss
9	Feature Selection Based On Rough Set For Binary-class Imbalanced Data
10	The Research Based On Logistic Algorithm And Data Sampling Of Unbalanced Classification Data