| The prosperity of consumer credit industry not only promotes the development of economy,but also brings the problem of credit risk.Major banks and financial institutions all hope to extract valuable information from the massive information of customers,and then analyze the credit rating of customers,so as to effectively avoid credit risk.Therefore,how to apply data mining technology to personal credit scoring model and improve model prediction performance has become an important research direction.In this paper,the credit data set is obtained from Data Castle big data competition platform.First,through exploratory data analysis,descriptive statistics are carried out on each characteristic variable,the coverage rate and missing rate are calculated,and the quality of the data is preliminarily tested.Second,combed the related problems of missing values,and selects three methods: K-nearest neighbor imputation,multivariate feature imputation,and random forest imputation to process the missing data of the original data set,segments the interpolated data set,fits the training set through the decision tree classification algorithm,uses the fitted decision tree algorithm to predict the results of the test set,and compares the classification accuracy of the three methods.The experiment shows that the multivariate feature interpolation method is slightly better than the other two.Then,an improved Boruta feature selection algorithm is proposed,and the feasibility of the improved method is verified by using the data set and decision tree algorithm in the UCI machine learning database.The improved method is applied to the credit data set,combined with the WOE binning and IV value results,select the best and most suitable feature subset to participate in modeling.Finally,the credit data set is divided into 70% training set and 30%test set.The personal credit score prediction model is established by using the traditional credit scoring method logistic regression and data mining technology XGBoost algorithm.The prediction performance of the model is evaluated through the evaluation indexes such as accuracy rate,recall rate,ROC curve and AUC value.The AUC value of Logistic regression is 0.45,and the AUC value of XGBoost algorithm is 0.89.The experimental results show that the model based on the improved Boruta feature selection algorithm XGBoost has better prediction performance. |