Font Size: a A A

Research On P2P Credit Evaluation Method Based On Machine Learning

Posted on:2019-08-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y S GaoFull Text:PDF
GTID:2429330566477580Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
This article USES the real data of a credit company,There are 31 variables in the data,among which the customer type is the dependent variable and the remaining variables are independent variables,A logistic regression model,random forest model and support vector machine model are established from this data and then compares the ability of these three models to identify bad customers.Finally,the model is validated by a 10-fold cross validation method.Since there are 6664 good customers and 330 bad customers in the data used in this paper,and the data is unbalanced data,this paper introduces the SMOTE algorithm which deals with unbalanced data.in the original data,a few samples(bad customers)were undersampled,and most of the samples(good customers)were sampled to obtain new data.In the original dataset and new data sets,the ability of identifying bad customers is analyzed and compared on the two data sets.The result of 10 fold cross validation shows:(1)On the original data set,The error rate of random forest model was 0.042,it is the smallist among these three models,The true rate of random forest and support vector machines is the largest,both of them arrive to 1.000,The true negative rate of logistic regression is 0.56,it is the largest among these three models.This suggests that logistic regression is better than the other two models in identifying bad customers.(2)On the new data set,random forest is the best in the three indexes of error rate,true rate and true negative rate.It is 0.057,0.987 and 0.870 respectively.This shows that the ability of the random forest model after SMOTE algorithm is the best.Compared to the original data set,In addition to the error rate increased 0.015,the other two indicators are improved,especially the true negative rate increased 0.762,suggesting that after dealing with the SMOTE algorithm,random forest model to identify the bad customer ability has been greatly improved.In the end,the paper aims at improving the SMOTE algorithm,The SMOTE algorithm based on logistic regression is proposed.In this paper,the interpolation algorithm of SMOTE algorithm is weighted according to the probability of each sample obtained by the logistic regression model,and the improved data is used to train the random forest model.The improved SMOTE algorithm,(1)in terms of margin of error,compared with the original SMOTE algorithm,the algorithm decreased by 0.013,which was 0.002 higher than the original data.(2)In terms of real rate,compared with the original SMOTE algorithm and the original data,it was reduced,but not by a large margin,which was 0.013 and 0.02 respectively.(3)In the aspect of true negative rate,the original SMOTE algorithm and raw data have been increased,respectively,by 0.804 and 0.047 respectively.It can be seen that the improved SMOTE algorithm gains a substantial increase in true negative rate at the expense of the smaller error rate and real rate,which is beneficial to the credit business of identifying the risk mainly(bad customers).
Keywords/Search Tags:random forest, support vector machine, logistic regression, SMOTE algorithm
PDF Full Text Request
Related items