Font Size: a A A

Research On Classification Of Imbalanced Datasets Based On Random Forest

Posted on:2022-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:S T YinFull Text:PDF
GTID:2517306509989029Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Machine learning algorithms usually have a hypothesis: the data set is evenly distributed.However,in practical applications,the data set will be affected by many external reasons,and there will be problems with limited sample size and unbalanced categories.Because the data set does not meet the assumptions,when the traditional classification algorithm classifies the imbalanced data set,the overall prediction result of the algorithm will be more inclined to the majority class in the data set.In contrast,the model is accurate in classifying the minority class is low.In order to achieve better prediction results,the classification accuracy can be improved from the data level and the algorithm level respectively.At the data level,consider improving the imbalance of categories in the original data set through sampling methods;at the algorithm level,use integrated learning algorithms to replace simple classifiers for model prediction.The main ensemble algorithm used in this thesis is random forest.Before random forest modeling,the data is sampled and processed to generate Borderline-SMOTE1-RF model,SMOTE-Tomek Link-RF model,SMOTE-ENN-RF model,RU-SMOTE-RF model,RU-ADASYN-RF model to model the data set,using online shoppers purchase intention data set and Portuguese banking institution direct marketing activities data set for empirical analysis,AUC value and G-mean value as evaluation indicators.Take the Random Forest model as a benchmark to compare the classification performance of the five models.The results show that the RF model combined with mixed sampling has improved classification performance.Among them,RU-SMOTE-RF has the best classification effect,indicating that the combination of mixed sampling and random forest is an effective method to deal with imbalanced data.
Keywords/Search Tags:Imbalanced data, Mixed sampling, Ensemble learning, Random forest
PDF Full Text Request
Related items