Font Size: a A A

Prediction Of Credit Overdue Behavior Based On Data Mining Algorithm

Posted on:2021-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:W MaFull Text:PDF
GTID:2428330620963497Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
With the development of China's economy and the change of people's consumption concepts,people's demand for credit business is increasing,and credit business has gradually developed,bringing new profit directions for banks.Failure to carry out an effective review and evaluation of the client's qualifications and repayment ability will cause the bank to assume greater financial risks.Therefore,there is an urgent need for efficient and accurate methods to manage risks and effectively identify customers who may be overdue in the future,so as to provide some references for the construction of the bank's credit system and the evaluation of customers.This article mainly uses historical loan data of a lending institution as an example to establish a credit overdue behavior prediction model.Firstly,the work of data cleaning and data processing is performed.Then,the article selects features with large amount of information through WOE binning and IV values,and calculates the correlation coefficient to remove the strongly correlated variables to avoid affecting the experimental results.The loan data is extremely unbalanced,which will lead to the complete failure of a single classifier,and the classification results obtained by the integrated model are not particularly ideal.In response to this kind of problem,this article uses a combination of random undersampling and SMOTE oversampling to balance the training set to avoid excessive loss of data caused by using only undersampling or introducing too much noise using only oversampling.In terms of model selection,Logistic regression,support vector machine,decision tree-based integrated algorithm random forest and Light GBM are used to model separately on the balanced training set,and then makes predictions on the original test set to obtain the confusion matrix of each model.The model is evaluated according to the accuracy of prediction and the AUC value.The study found that the modeling performance on the balanced data set improved the classification performance of each classification algorithm significantly,but there was a big difference between them.For single models,the problem of model failure is solved,and a certain classification effect can be achieved.Logistic regression performs better than support vector machines,but the prediction accuracy of a single model is general.Compared with the single model,the indicators of the integrated model have been greatly improved.Among them,the random forest model has achieved good results.While improving the training speed,Light GBM maintains the prediction accuracy of the gradient boosting algorithm.From the perspective of various evaluation indicators,the overall classification effect of Light GBM is stronger than that of random forest.In addition,Light GBM has the advantages of fast running speed and small memory footprint,and fine-tuning the parameters can get a good classification effect.So Light GBM performs best.Therefore,Light GBM is more suitable for banks or lenders to establish a model to identify overdue users.In addition,through the analysis of the importance of various indicators,it is found that the indicators of transaction information will provide more information for classification,which provides some reference for banks or loan institutions and credit reporting systems to collect customer information more effectively.
Keywords/Search Tags:Credit overdue, Logistic regression, Support vector machine, Random forest, Light GBM
PDF Full Text Request
Related items