Font Size: a A A

Analysis And Research On Unbalanced Data Of Credit Score Based On Stacking Integrated Algorithm

Posted on:2021-05-04Degree:MasterType:Thesis
Country:ChinaCandidate:S LiFull Text:PDF
GTID:2428330611970414Subject:Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the continuous expansion of China's personal consumer credit business,risk control in the financial sector has become crucial.Banks use credit scoring systems to evaluate and predict borrowers' repayment ability and personal credit.However,the number of customers who are overdue after the loan is a minority,that is,the data set used by the bank to establish a credit score model is not overdue(positive sample)is much larger than the overdue sample(negative sample),such a data set is called an unbalanced data set.The results obtained by banks using a credit scoring model built with unbalanced data sets will be biased towards the majority category samples(unexpired customers),that is,it is easy to classify the minority category samples incorrectly,and the recognition rate of minority category samples(overdue customers)Lower.Aiming at the above unbalanced data set problem,the SMOTE oversampling algorithm is used to balance the data set.The SMOTE oversampling algorithm generates new minority class samples based on the minority class samples(overdue customer samples)in the dataset.The newly generated samples may blur the classification boundaries of the positive and negative samples and reduce the classification effect of the model.For the problem that the new sample will blur the classification boundary,the MODIFIED-SMOTE oversampling algorithm is proposed.First,the 15% of the data in the minority class sample that is closest to the classification boundary is removed,and then a new For minority class samples,each time a new sample is generated,the KNN algorithm is used to determine whether the newly generated sample belongs to the minority class,and the sample belonging to the minority class is retained,otherwise the newly generated sample is discarded.In this way,it is more effective to avoid the fuzzy classification boundary of the newly generated samples and the generation of erroneous samples.From the perspective of the model,this paper proposes the SLRA-Stacking(MODIFIED-SMOTE Logistic Random Forest Adaboost Stacking)model suitable for credit scoring,SLRA-Stacking model is a combination of MODIFIED-SMOTE oversampling algorithm and Stacking integrated algorithm,which can be more suitable for the unbalanced characteristics of credit score data set;Secondly,from the perspective of improving the performance of the integrated model,comprehensively consider the advantages and disadvantages of each single model,realize the diversity of the base classifier by combining different classification models,and combine the probability of model prediction with the original modeling attribute variables for secondary learning to achieve more Strong generalization ability.In this paper,five model training data sets are selected,namely Logistic,Random Forest,Adaboost,Stacking and SLRA-Stacking models,and the conclusion is drawn by comparing the effects of each model before and after the data set balance processing: Models trained with unbalanced data sets are less effective in classifying overdue customers than models trained on balanced data sets,and trained on balanced data sets The test effect of SLRA-Stacking in the model is better than other models,the model is stable,and the generalization ability is strong.Therefore,SLRA-Stacking can meet the needs of banks' personal credit scores and has certain practical value.
Keywords/Search Tags:Personal credit score, MODIFIED-SMOTE, Adaboost, SLRA-Stacking
PDF Full Text Request
Related items