Font Size: a A A

Research On Classification Method Of Imbalanced Data Set Based On Improved Sampling Strategy

Posted on:2022-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:L R LiFull Text:PDF
GTID:2518306521981439Subject:Statistics
Abstract/Summary:PDF Full Text Request
Unbalanced data sets exist in large numbers in today's social life,such as lung cancer patient diagnosis data,credit evaluation data,and network attack identification data.The classification of unbalanced data sets is called unbalanced classification,so they will pay more attention to the characteristics of the majority of samples while ignoring the information of the minority of samples in the classification of unbalanced data sets.As a result,the minority of samples is difficult to be identified,but the minority of samples often have more important value.The existing solutions to unbalanced data classification can be divided into three levels: data,algorithm,and combination of the two.Among them,at the data level,the number of samples in different categories is changed to balance the number of samples in the data set.At the algorithm level,the weight of the minority of samples is increased to increase the importance of the classifier to the minority of samples.The SMOTE algorithm is the over-sampling method of the data layer,which mainly generates a new few samples by linear interpolation between the adjacent the minority of samples to balance the original data set.This method effectively solves the over-fitting problem caused by the random over-sampling method,but there are still some shortcomings such as the inability to discriminately select the minority samples and ignore the information of the majority samples in the neighbors when generating new samples.Therefore,this paper proposes a new oversampling method Re W-SMOTE on this basis.Compared with the SMOTE algorithm,the method proposed in this paper can realize the differentiated selection of minority samples and utilize the information of the majority samples in the neighbors when generating samples,which improves the quality and diversity of the minority samples generated.The experiments are carried out on multiple real unbalanced datasets of UCI and Keel,using AUC,F1,Recall,TNR,Precision,and G-Mean as evaluation criteria,and compared with other resampling methods.The experimental results show that the Re W-SMOTE method can effectively solve the problem of difficulty in classifying minority samples in unbalanced data sets,and the classification of minority samples by this method is more accurate and stable than the SMOTE method and Borderline-SMOTE method.
Keywords/Search Tags:Unbalanced data, Classification, Over-sampling, SMOTE
PDF Full Text Request
Related items