Credit Risk Assessment Based On Improved Random Forest

Posted on:2020-06-26

Degree:Master

Type:Thesis

Country:China

Candidate:Q J Ma

Full Text:PDF

GTID:2428330578956102

Subject:Internet of Things technology and applications

Abstract/Summary:

PDF Full Text Request

Faced with the growing international status in recent years,all industries have been developing steadily and rapidly,and there has been in good situation.The domestic economic environment.The banking industry also rises to a huge challenge.It is a powerful tool for commercial banks that credit business open up their personal consumption market and develop quality potential customers.With the increasing diversity of individual and the personal credit model needs in personal credit,but no matter how the model evolves,the key point will need credit risk management.Considering that there are the high frequency of personal credit business and the large amount of customer information,it has played an increasingly important role that takes full advantage of data mining and analysis technology in credit risk management.The credit default risk assessment system is expanded in paper based on the integrated learning algorithm model,and uses the actual credit data to combine the data mining related theory and technology of credit default risk management.By means of effective argument of feature analysis and default prediction model for credit data,it has made a big contribution to the improvement of credit risk management capabilities of commercial banks.In view of credit default data is a common unbalanced data.data balancing must be performed before modeling analysis so as to reduce the impact of data imbalance on the prediction model.First of all,it uses the boundary sample pruning method to eliminate most categories samples in the "TomekLink pair" in this paper.Subsequently,reuses the Gaussian mixture model to divide the minority sample regions in the feature space.and there is so selecting a small number of samples in a sub-region for a small number of sample generation operations that it can reduces the occurrence of aliasing of new samples due to random sampling in the SMOTE algorithm.At the last but not least,the combination of oversampling and under-sampling is made use of adjusting the balance threshold so that can it achieve the purpose of data balance.By comparing the prediction results and error of the integrated models before and after the data balance in the trial,although accuracy of the prediction before and after the balance has a small fluctuation,the misclassification rate of the minority samples is obviously optimized,and the maximum rate of decline is at the least of 28.5%.In terms of credit risk managers,the reasons for the result are more valued in compared with high-precision prediction results.It uses the random forest variable importance evaluation system to make a score and rank in the importance of each dimension feature in this paper,which reveals the attribute meaning based on cross-analysis of different fractional intervals behind the data for credit risk managers while reducing the stability of the model.Firstly,the feature importance score is divided into four score segments in descending order in the trial.Then,the highest scores are credit grade and the number of overdue,and the lowest is the number of family members and current residence.Finally,random forest modeling analysis is carried out by adding and deleting features in different fractional segments.It shows that variables with a score interval below 0.5 do not have a large impact on the model in the trial,so they are not used as a key reference in the crediting process.For each score segment feature above 0.5 that the credit review process must be strictly checked.In the term of the random forests,the non-selective integration and the simple minority majority rule principle are used to judge the final result,which is ignoring the strong and weak differences between the decision trees in the model and resulting in lower prediction accuracy.In the light of the shortcoming of the above in the paper,it is constructed by the similarity measure that the decision tree cluster selectived,so that the result is output by using the dynamic weighted voting fusion method in the final voting session,which improves the accuracy and stability of the random forest model in some sense.Compared with the five credit default prediction models in the experiment,the average prediction accuracy of the CM-RF model reached 86.34%,secondly,only to the two SVM hybrid models that achieved the best results in the minority sample misclassification rate,only it is 9.29%.Finally,it shows that the model AUC value is 0.8839 in the ROC curve comparison,which is the highest stability compared with other comparison models.

Keywords/Search Tags:

Commercial Bank, Credit Default, Imbalanced Data, Random Forest

PDF Full Text Request

Related items

1	Research On Credit Default Risk Control Under Imbalanced Data
2	Research On Early Warning Of Credit Debt Default Risk Of Listed Companies
3	Research On Credit Default Identification Method Based On Deep Learning
4	Research On Credit Card Fraud Detection Based On Random Forest
5	Research For Imbalanced Big Data Classification Algorithm On Random Forest
6	Rule Extraction For Imbalanced Data Classifica- Tion Based On SVM And Its Application In Commercial Bank Failures Prediction
7	Research On Imbalanced Data Classification Method Based On Random Forest Algorithm
8	Class-Imbalanced Data Stream Classification Method Based On Adaptive Random Forest
9	Research On Forecasting Accuracy Of Bank Credit Card Customer Default Probability Based On Data Mining Technology
10	Research On Default Risk Identification Of Online Loan Based On Machine Learning Hybirdmodel