Font Size: a A A

Research On Unbalanced Data Processing Method For Credit Scoring

Posted on:2022-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:J H ZhaoFull Text:PDF
GTID:2518306731472554Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of financial lending,the problem of credit risk is increasing.Therefore,the establishment of a good credit scoring model plays a very important role in the field of financial lending to reduce financial credit risks.Machine learning methods are currently the main method for establishing credit scoring models,but the classification models in existing machine learning methods tend to bias the prediction results more toward the majority class and the accuracy of the minority class when learning category imbalanced data.It is not high and is easily affected by noise data,which is not conducive to the training of the classification model.However,in practical applications,the default samples in the financial lending data set are often less minority samples than the non-default samples.Therefore,the data sets in the financial lending field have different degrees of category imbalance,and the minority class in the credit scoring model The prediction result of the sample is more important than the prediction result of most samples.In order to solve the problem of data imbalance in the field of financial lending and the impact of noise data on the classification model and improve the overall performance of the credit scoring model,the main research contents and innovations of this paper are as follows:(1)Aiming at the problem of unbalanced categories of financial lending data sets,this paper proposes an oversampling algorithm(TK-CTGAN)that combines Tomek-Link and Tabular Data using Conditional GAN(CTGAN).This algorithm First,by effectively filtering the noise and boundary samples in the data set,and then using CTGAN to learn the filtered minority samples and generate synthetic samples that conform to the distribution of the minority data,the minority samples are expanded.The algorithm avoids When oversampling the minority samples,the introduction of noise samples affects the classification performance of the classifier,and improves the category imbalance problem in the financial lending data set.It is proved through experiments that the algorithm is compared with the synthetic minority oversampling technology(Synthetic oversampling technique).Minority Oversampling Technique(SMOTE)has more advantages in dealing with imbalances.It is higher than the SMOTE algorithm in terms of AUC and Recall.Especially in the e Xtreme Gradient Boosting(XGBoost)classification model,TK-CTGAN's Recall is compared to SMOTE has increased by 18.9%,and it is better than SMOTE algorithm in terms of AUC index,which shows that TK-CTGAN algorithm can improve the recognition ability of minority samples and the overall classification performance.(2)In order to establish a better credit scoring model,this paper proposes an unbalanced ensemble model TKEE-XGBoost,which is an ensemble classification model with XGBoost as the base classifier under the framework of Easy Ensemble,and is combined with TK-CTGAN.The training data set is subjected to noise filtering and minority sample expansion.The model optimizes the classifier by extracting multiple samples of the majority class and the same number of samples of the minority class to train a number of classifiers for ensemble learning.Experiments show that the model is higher than the original in terms of Accuracy,Recall,and AUC indicators.The Easy Ensemble algorithm,especially the Recall indicator of TKEE-XGBoost is increased by 7.5% compared to the traditional Easy Ensemble method,which shows that TKEE-XGBoost can not only improve the overall classification effect of the credit scoring model,but also has advantages in identifying minority samples.In the field of financial lending,default samples belong to minority samples,so it is particularly important to improve the ability to identify minority samples.Through experimental analysis and verification,the two algorithms proposed in this paper,TK-CTGAN and TKEE-XGBoost,both have better Recall indicators.Significant improvement,which shows that the two algorithms proposed in this article can effectively improve the ability to identify minority samples,which can effectively identify default samples in the financial lending field,reduce investors' economic losses,and reduce financial credit risk.
Keywords/Search Tags:Generative Adversarial Network, Imbalanced data, Ensemble Learning, XGBoost, EasyEnsemble
PDF Full Text Request
Related items