| In contemporary society,with the country’s vigorous implementation of inclusive financial policies,changes in personal consumption concepts and the convenience of loans based on the Internet,loans have become more convenient and feasible,and loan demand has continued to rise.However,borrower credit risk,credit investigation risk,fraud risk and other lending risks emerge one after another,and in the face of some new areas of loan business,it is essential to establish an effective business model without or only a small amount of data,which brings great challenges to the business development of lending platforms such as banks.In order to improve the business efficiency and security of lending platforms,it is necessary to evaluate the repayment ability and risk control for lending users.Based on sample data,this thesis uses ensemble learning and transfer learning to predict loan default,and realizes its application in the field of financial risk control.This thesis uses network credit data and personal loan data from the Data Fountain website as experimental data.Firstly,through descriptive statistical analysis,visual analysis and analysis of the relationship between independent and dependent variables,it is preliminarily explored that the characteristics of loan interest rate,loan amount,debt-to-income ratio,loan grade,and number of early repayments have a significant impact on whether or not to default.Then,the KS test,t-test and KDE kernel density map were used to measure the similarity of the variables of the two datasets from the perspective of edge distribution,and it was found that the distribution of some variables was similar.In the data preprocessing part,the missing value processing,outlier value processing and feature transformation are carried out;In the feature engineering part,6new variables were derived based on business logic,and the IV value,Light GBM feature importance and correlation coefficient method were used to eliminate unimportant variables and redundant variables,and finally 19 independent variables were retained for the establishment of loan default prediction model.In the loan default prediction based on ensemble learning,the loan default prediction models based on XGBoost and Light GBM are constructed for personal loan data,and the unbalanced data is treated with cost-sensitive weighting when modeling,and the grid search is carried out by five-fold cross-validation to find the optimal model parameters.Both models had good prediction effects on the test set,with AUC values of 0.854301 and 0.8548391,respectively.In addition,the Light GBM model is better than the XGBoost model in the test set of personal loan data,such as recall rate and KS value,and has stronger generalization ability.In the loan default prediction based on transfer learning,the sample migration method of K-means++ and LGB-Filter is used to migrate part of the network credit data to the personal loan data,and then XGBoost and Light GBM are used to establish the loan default prediction model.Compared with the ensemble learning method that only uses personal loan data,the prediction effect of the two sample transfer learning methods is improved to different degrees.The AUC values of the transfer learning method based on K-means++ on XGBoost and Light GBM models were 0.858264 and0.864808,respectively,which increased by 0.46% and 1.17%.The AUC values of the transfer learning method based on LGB-Filter on XGBoost and Light GBM models were 0.859118 and 0.859437,respectively,which improved by 0.56% and 0.54%.The method with the largest improvement in recall and G-mean and the largest decrease in BER was the LGB-Filter method,which increased the recall by 8.75% and 7.25%,respectively.G-mean improved by 3.12% and 2.47%,respectively;BER decreased by12.02% and 10.54%,respectively.In addition,LGB-Filter and Light GBM are the best comprehensive loan default prediction models in this experiment,with the highest recall,G-mean,KS and lowest BER values among all models,and the AUC value,F1 score and accuracy are also good.Finally,the research content of this thesis is summarized,and the shortcomings in the research process and the prospects for future research directions are pointed out. |