Font Size: a A A

Research On Credit Default Prediction Based On Machine Learning

Posted on:2022-11-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z J HuangFull Text:PDF
GTID:2518306614970539Subject:FINANCE
Abstract/Summary:PDF Full Text Request
With the vigorous development of the Internet industry,the traditional financial industry began to transform towards financial Internet and Internet finance.With the characteristics of low threshold,fast and convenient,high yield and the advantages of Internet platform,online credit has grown rapidly.However,there are also high risks behind high returns.Because it reduces the threshold of lending,it has derived a series of problems,including illegal fund-raising,abscond with money,telecommunications fraud and so on.How to prevent user fraud and control credit risk has been an urgent problem to be solved in recent years.Therefore,introducing machine learning algorithm to optimize credit risk control system and promote the healthy development of credit business market is an effective way.There are two main difficulties in credit risk control modeling: First,the real business data set often presents the phenomenon of extremely uneven data distribution(That is,the number of defaulting customers is very small,and the default sample information is extremely lacking),which leads to the difficulty of model prediction.Second,due to the requirements of the credit risk control system for the interpretability of the model(That is,when building the risk control model,we need to know how each feature x affects the prediction result y,so that the business department can carry out its work.For example,the "annual income" in sesame's credit score increased by10000 yuan and the score increased by 10 points),With its absolute advantage in interpretability,linear model logistic regression(LR)has become the most important choice in credit default prediction.However,LR model has limited ability to learn nonlinear features,resulting in low prediction accuracy of the model.Therefore,based on the sample imbalance optimization method and machine learning algorithm,this paper studies the problem of credit default prediction,mainly including the following three stages: In the first stage,sort out the relevant literature and research,and analyze the demand and necessity of credit default prediction model.In the second stage,from the two levels of data and model,deconstruct the optimization idea of default prediction model and determine the key factors in the optimization process.In the third stage,empirical analysis,taking the data of four quarters of 2019 on the Lending Club platform as the object(a total of 518107 credit records and 150 characteristic variables),comparative analysis is carried out in the two dimensions of classification algorithm and sampling algorithm to verify the effectiveness of sample imbalance optimization method and SMOTETomek-LightGBM-LR model.The innovation of this paper is mainly reflected in two aspects: data and model.At the data level,according to the characteristics of extremely unbalanced sample distribution of credit data set,this paper puts forward the optimization method of sample imbalance.Firstly,according to the sample distribution of the data set,the optimal sampling proportion of the sampling algorithm is searched;Then the sampling algorithm is used to expand a few samples according to the optimal sampling proportion;Finally,through cost sensitive learning,adjust the sample weight to increase the misclassification cost of a few samples.At the model level,according to the limited learning ability of LR model to nonlinear features,the credit default prediction model of SMOTETomek-LightGBM-LR is designed.Firstly,the sample imbalance optimization method based on SMOTETomek algorithm is adopted to improve the data quality;Then the LightGBM algorithm is used for feature derivation.The path from the root node to the leaf node is used as a new feature,combined with the original feature as the input of LR model,so as to improve the learning ability of LR model for nonlinear features.Finally,the prediction accuracy of different classification models is evaluated by AUC value,KS value and recall rate.The experimental results show that:(1)The sample imbalance optimization method proposed in this paper improves the AUC value,KS value and recall rate compared with the prediction results before optimization,which verifies the optimization results.(2)Compared with other models,the SMOTETomek-LightGBM-LR credit default prediction model designed in this paper has the highest prediction results in AUC value,KS value and recall rate,but the gap is very small.Based on the perspective of practical significance,this paper has two main contributions: First,it improves the dilemma of extremely uneven data distribution in the actual credit approval business to a certain extent,and provides a new idea for financial institutions such as online loan platform to improve the risk control system;Second,the model fusion method is adopted to improve the accuracy of credit default prediction,which provides a new method for the credit industry to reduce the default rate and improve the efficiency of capital utilization.Accordingly,this paper puts forward three suggestions for the future development of online loan industry: First,integrate external credit data,broaden the source channels of credit data,and build a credit investigation system that keeps pace with the times;Second,change the interest rate setting mechanism,appropriately lower the interest rate of users with low credit rating,and reduce the risk of default;Third,protect the rights and interests of investors,skillfully use machine learning algorithms(such as model fusion)to improve the construction of risk control system and promote the sustainable development of the industry.
Keywords/Search Tags:online finance, default forecast, sample imbalance, logistic regression
PDF Full Text Request
Related items