Font Size: a A A

Research On Imbalanced Data Classification Based On Sampling Method And Ensemble Learning

Posted on:2022-04-13Degree:MasterType:Thesis
Country:ChinaCandidate:L ChenFull Text:PDF
GTID:2518306557464394Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
In the data classification,we usually assume that the number of different types of data is balanced,but the proportion of real balanced data is very little,and the imbalance of data is a potential factor which affects the classification.Because there are minority class and majority class in the data.In order to ensure the accuracy of classification,the classification results of imbalanced data tend to be inclined to the majority class.However,in fact,the important minority samples are misclassified.This paper researches the classification of imbalanced data from the data level and algorithm level,and the main work is as follows:(1)The typical imbalanced data processing methods are undersampling and oversampling.This paper analyzes the shortcomings of SMOTE and proposed A combined sampling method named CL-SMOTE method.CL-SMOTE is insensitive to outliers in the undersampling part of k-means clustering.The improved method is proposed based on PAM clustering,and the minority samples are synthesized in the oversampling part combined with the Central Limit Theorem.This paper chooses 7 datasets and experimental results show that the average accuracy of this method in dealing with imbalanced data is 6.84%higher than SMOTE method in F1 value,6.77%in AUC value and 13.5%in G-mean value.(2)Considering that CL-SMOTE is easy to lose too much information and generate too many noise samples when facing the dataset with high imbalance ratio,this paper combines a temporary marking method called Temp C and processes it in two stages.The high imbalanced data is transformed into low imbalanced data.CL-SMOTE is used in the second stage to process low imbalanced data,and finally Temp C-CL-SMOTE method is formed.The experimental results show that this method improves the precise of CL-SMOTE for 3.7%-7.9%and is about 4%higher than Temp C method.(3)Against the classification algorithm,first of all,this paper analyzes the Ada Boost algorithm and finds that the formula with error rate is mainly used in Ada Boost to define the weight of the classifier.However,in the face of imbalanced data,simply using the error rate can not measure the classification effect of imbalanced data,which makes Ada Boost algorithm unable to deal with imbalanced data classification.Then,it is inspired by the calculation formula of F1 value.If the cost of positive classification with higher importance can be reasonably increased,it can be well adapted to the imbalanced data classification.Therefore,we proposes PFBoost algorithm and construct a new indicator Bt.Bt will be introduced into(?)_m,the classifier weight of the algorithm,and the attention adjustment parameter(?)in the Bt will be changed in equidistance,so that it can adapt to the imbalanced data of various imbalanced ratios.The classifier weight and sample weight in the algorithm pay more attention to the error classification of the minority samples in each iteration.Finally,the experiment shows that PFBoost is much more flexible to adapt to the data classification in various imbalanced ratios than other ensemble algorithms.
Keywords/Search Tags:Imbalanced data, sampling method, TempC-CL-SMOTE, ensemble learning, PFBoost, adaptive
PDF Full Text Request
Related items