Font Size: a A A

Research On Credit Default Risk Control Under Imbalanced Data

Posted on:2022-06-09Degree:MasterType:Thesis
Country:ChinaCandidate:F K ChenFull Text:PDF
GTID:2518306311966419Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
Internet finance emerges with the development of big data.As the core and foundation of financial business,risk control has always been the focus of atten-tion.Consumer finance can effectively improve the consumption level and ability of residents.The loan targets of consumer finance are characteristized by large crowds.small amounts,short cycles and strong complexity,which make it more difficult to accurately predict whether a borrower is likely to default.With the penetration of artificial intelligence technology,fintech can provide more accu-rate risk control services.Theoretically,credit default prediction is typically a binary classification problem under extremely imbalanced data.In addition to being high-dimensional and sparse,data often contains alarge number of noisy samples,making traditional mathematical statistics and machine learning meth-ods ineffective in identifying users with potential default risk,which may cause more losses financially and hinder the development of the country's economy.The commonly used methods to solve the problem of imbalanced data classification include data level sampling methods and algorithm level cost-sensitive learning.Therefore this paper starts with both kinds of solutions above and relies on ad-vanced machine learning methods to predict the users at risk of default.The main work and contributions of this paper are as follows:1.Based on data mining technology,data cleaning and data transformation are carried out on the borrower data,and the feature selection method com-bining variance filtering and mutual information selection is used to determine 36 borrower features for further research.which ensures the validity of the se-lected borrower features.Then,acredit default prediction model is built based on LightGBM algorithm,and it is compared with the common machine learning algorithms KNN,support,vector machine,random forest,etc.The experimental results show that the prediction model based on LightGBM algorithm is signifi-cantly better than other models in AUC,G-mean and Recall values.2.At the data level,in view of the problem that most existing algorithms fail to consider the noise samples in the synthesized few classes and the imbal-anced distribution within the class at the same time,the oversampling algorithm Minority-Kmeans-SMOTE is proposed,Firstly,noise samples are identified based on the KNN algorithm,Then,K-means algorithm is used to cluster the non-noisy samples in the few classes into different clusters,and targeted over-sampling is conducted according to the density of the few classes in the clusters,so as to alleviate the intra-class imbalance.When the dataset is extremely imbalanced,the hybrid sampling algorithm MKSE-LGBM is formed by combining Minority-Kmeans-SMOTE algorithm with the subsampling algorithm based on Easy En-semble idea.The results on KEEL public datasets and credit default dataset show that MKSE-LGBM algorithm can effectively improve the AUC,G-mean and Recall values of the model.3.At the algorithm level,the traditional cross-entropy loss function has poor performance in the face of imbalanced data classification,so we can combine cost sensitive learning with LightGBM algorithm by modifying the loss function.Specificaly,the traditional cross entropy loss function in LightGBM model is modified into different forms of weighted loss function.The experiments show that the model effect is improved.Further considering the different degree of difficulty of sample classification,the Focal loss function in the field of object de-tection is introduced.The experimental results show that the Focal loss function can effectively improve the AUC.G-mean and Recall values of the model.
Keywords/Search Tags:Credit Default Forecasting, Imbalanced Data, LightGBM, Hybrid Sampling, Focal Loss Function
PDF Full Text Request
Related items