Font Size: a A A

Personal Credit Evaluation Analysis Based On Multi Dataset Fusion

Posted on:2021-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:P M QiFull Text:PDF
GTID:2506306245481734Subject:Master of Applied Statistics
Abstract/Summary:PDF Full Text Request
In recent years,the continuous improvement of personal credit system has led to many problems in personal credit evaluation,such as different data sources,complex data structure and large scale.Based on the traditional credit technology,it is difficult for the financial industry to overcome the current data level problems and the bottleneck of evaluation model level technology.The development of big data improves the position of risk control in the operation of enterprises.In the case of different data sources,the evaluation and analysis of personal credit helps banks to reduce the non-performing loan rate,and becomes the right assistant of risk evaluation in the refined operation of banks.This paper analyzes the advantages and disadvantages of the traditional credit evaluation model based on the two problems of how to integrate multiple training sets in the current personal credit evaluation and whether the traditional evaluation technology is applicable.The credit evaluation is carried out by simultaneous interpreting the GBDT/XGBoost+LR model,Stacking algorithm and TrAdaBoost algorithm.And empirical analysis was conducted on the data set of open competition of Qianhai credit investigation company on kesci.com.Then the results are compared with the traditional LR model.First of all,preprocess the missing value,abnormal value and skew variable of data.SMOTE algorithm is used to balance the positive and negative samples of data set.Pearson correlation coefficient and Random Forest algorithm are used to reduce the dimension of data.According to the importance ranking of variables,the top 45 features that have the greatest impact on medium credit loan business A and small short-term loan business B are selected.The LR model,GBDT + LR model,XGBoost + LR model,Stacking algorithm and TrAdaBoost algorithm were implemented by Python 3,and the classification results of these models were compared and analyzed.Five-fold cross-validation and grid search method were used to optimize the parameters to find the best classifier.Accuracy rate,recall rate and F1 value are used as auxiliary evaluation indexes,and AUC value as the main evaluation indicators.Compared with the traditional LR model,the results show that the Stacking algorithm performs best in the four models,and the computation speed is fast.The F1 and AUC values of the model are as high as 0.89 and 0.87 respectively.Secondly,GBDT + LR model and XGBoost + LR model have AUC values of 0.84 and 0.82,respectively.In terms of the feature importance,GBDT + LR model prefers customer’s personal information features,while XGBoost + LR model prefers customer’s information of historical purchase of products.Several models output some common and high contribution features,such as UserInfo40、UserInfo50、UserInfo254 and ProductInfo31.Compared with the other three methods,the performance of the classifier learned by TrAdaBoost algorithm is slightly worse,whose F1 value is 0.75 and AUC value is 0.74.And the time-consuming of TrAdaBoost algorithm is the longest.Finally,the best performing Stacking algorithm is selected to predict a given test sample.And the key business analysis is carried out based on the relative importance scores of each variable obtained in the model training.The innovation and improvement of credit evaluation model for multiple data sets is of great practical significance for enterprises to avoid risks,which is worth exploring and studying.
Keywords/Search Tags:credit evaluation, multiple data sets, GBDT/XGBoost and LR fusion model, Stacking algorithm, TrAdaBoost algorithm
PDF Full Text Request
Related items