Font Size: a A A

Research On Classification Prediction Model Based On Imbalanced Sampling

Posted on:2021-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:G H DaiFull Text:PDF
GTID:2510306302974409Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
In recent years,with the continuous progress and development of machine learning theory,machine learning is better than traditional classification prediction models in many application scenarios of classification prediction.The use of machine learning can gradually bring practical value to many industries.Many machine learning models are moving from academia to industry.The classic hypothesis in machine learning often assumes that each category of the training sample is the same number,that is,the number of types of samples is balanced,but the actual problems encountered in real scenarios often do not meet this assumption.Imbalanced samples will cause the training model to focus on the categories with a larger number of samples,and "dismiss" the categories with a smaller number of samples,so that the model's generalization ability on the test data will be affected.For most classification tasks in real scenarios,the number of data in each category is basically impossible to be completely equal,but the difference will not have a great impact.In real-life scenarios,the category imbalance problem is common and reasonable and meets expectations.For example,in fraudulent transaction identification,those that belong to fraudulent transactions should belong to a small part,and most transactions are normal.This is a very typical category.Imbalance.Some literature pointed out that if the class imbalance ratio exceeds 4: 1,the classifier will not be able to meet the classification requirements because of the data imbalance.Therefore,the data imbalance problem needs to be dealt with before constructing a classification model.Imbalanced data will cause the traditional classification prediction model to perform poorly on the test set,so how to deal with imbalanced data is the key to building an efficient classification prediction model.With the development of science and technology,many scholars have proposed many methods to solve the imbalance.The more common ones are undersampling,oversampling,sampling combination,and sampling integration.However,there are no definite conclusions indicating which processing method is more conducive to the establishment of an efficient classification prediction model.Therefore,most literatures do not consider how to control imbalanced data.This article will explore various imbalanced processing methods in different imbalanced data Advantages and disadvantages.This article focuses on a systematic discussion of the applicability of various imbalanced approaches,highlighting generality by comparing results on multiple representative data sets.This paper selects multiple datasets with typical imbalances.According to different imbalance ratios,a total of 8 datasets are selected in this paper.Their imbalance ratio ranges from extreme imbalance(about 1: 129)to more balanced(about 1: 1.2).In terms of unbalanced processing,this article selects representative and commonly used algorithms in undersampling,oversampling,sampling combination,and sampling integration.Undersampling selects Tomek Link Removal,random undersampling Random Under Sampling algorithm,and oversampling SMOTE SMOTE + Tomek Link Removal algorithm is selected in the sampling combination,and SMOTEBoost and RUSBoost algorithms are selected in the sampling integration.In terms of classification prediction models,the model selected in this paper covers the two major types of integrated models and non-integrated models and is representative.Including logistic regression model,random forest model and Light GBM model.This article explores the applicability of various sampling methods on different types of models.The three models are trained using the seven sampling methods and the data set processed without sampling methods.The stability indicators AUC and F1 are selected to compare the performance of the models.According to the experimental summary designed in this paper,it is found that the oversampling method or the combination or integration method based on oversampling is more suitable for the processing of extremely imbalanced data.When the imbalance ratio is greater than 1:19,the undersampling or undersampling integration performs better excellent.When the imbalance ratio of the data is close to 1: 1,the effects of various sampling methods over the original data modeling are not significant.
Keywords/Search Tags:sample imbalance, undersampling, oversampling, machine learning
PDF Full Text Request
Related items