Font Size: a A A

Research On Personal Credit Default Prediction Based On XGBoost+RF

Posted on:2021-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y F WangFull Text:PDF
GTID:2518306107979959Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
With the transformation of China's economy from the stage of high-speed growth to the stage of high-quality development,consumption is gradually replacing exports and investment as a new engine of economic growth.In such a new stage of development,China's personal consumer credit market is showing amazing potential,which plays an increasingly important role in promoting the growth of consumer demand,improving the level of inclusive financial services and sustainable development of service economy.Due to the expansion of the total amount of personal consumption credit and the foreseeable rapid growth in the future,pre loan risk control will also become a very important issue.In recent years,the massive and high-dimensional customer data accumulation makes it possible to use quantitative analysis method to predict the customer's credit status and behavior more accurately before loan.In this article,some methods based on machine learning predict the default risk of credit before loan according to the basic information of ordinary customers.The main research direction is to promote the application and improvement of machine learning method in credit risk prediction.After analyzing the related technologies and characteristics of bagging and boosting algorithms,according to the characteristics of xgboost algorithm as a gradient lifting algorithm,the importance scores of each attribute and random can be obtained relatively directly Due to the introduction of randomness,the forest algorithm is not easy to over fit,anti noise,and fast in training,and easy to realize parallelization.Finally,the method of xgboost algorithm combined with random forest algorithm,xgboost + RF method is chosen as the main method of personal credit risk prediction in this article.The data samples used in this study are from the credit loan data of lending club in 2007-2015.In order to meet the needs of the research,before the specific algorithm model training,the data are preprocessed as follows: firstly,the original data is deleted with a large proportion of missing values,the target value is digitized,the feature is abstracted,the special feature is processed,the feature is derived,the null value is interpolated,and so on;secondly,due to the huge sum of the original data The proportion of positive and negative samples is unbalanced.According to the proportion of positive and negative samples of the original samples,i.e.1:13,140000 data are randomly selected as the training data set,and 1000 data are selected from the remaining original data samples as the actual application prediction set after the model training is completed.Finally,because the number of positive and negative samples in the training data set is seriously unbalanced,this article adopts smote +ENN method to sample.In order to show that the xgboost + RF algorithm proposed in this article has more advantages than the traditional machine learning method in personal credit risk prediction before loan,this paper compares two stages in model training: the first stage is the comparison between xgboost + RF algorithm and the separate xgboost and random forest algorithm;the second stage is the comparison between xgboost + RF algorithm and decision tree,SVM,logistic algorithm The comparison of the commonly used two class machine learning algorithm models,such as regression.The training effect of the model is compared with accuracy,precision,recall,F-measure,ROC curve and AUC value.The actual application effect of the model is compared with the final actual prediction accuracy.The final comparison result is that the training effect of xgboost model is the best in the first phase of comparison.The training effect of xgboost + RF model proposed in this article is similar to that of random forest model,but the accuracy of xgboost + RF model is much higher than that of xgboost model and random forest model in the actual prediction;decision in the second phase of comparison The training effect of tree algorithm is the best,xgboost + RF is the second,the training effect of logistic regression and SVM is almost the same,but the prediction accuracy of xgboost + RF is still the highest from the final practical application prediction results,so the conclusion that xgboost + RF algorithm model has advantages in personal pre loan credit risk prediction is drawn.
Keywords/Search Tags:data cleaning, evaluation index, xgboost, random forest, machine learning
PDF Full Text Request
Related items