Font Size: a A A

Research On User Loan Risk Prediction Based On Random Forest Algorithm Based On Spark Platform

Posted on:2019-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhouFull Text:PDF
GTID:2428330563453791Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years,Internet finance has developed rapidly and the development of online credit is very fast,and its participation bodies are increasingly diversified.Online credit helps more and more users to make emergency and short-term capital turnover.However,with the extension of the start-up time and the increase of the total amount of loans,the default rate has begun to rise,and its risk is gradually emerging.Therefore,it is of great practical significance for small online credit enterprises to prevent the risk of Internet finance by studying the risk of user loan and evaluating the risk of user loan scientifically and rationally.In the study of credit evaluation and risk prediction analysis of user loans,the user's data have the following three important features: Unbalanced distribution,a large number of noise and high dimensional features.The risk of user loan is caused by the interaction of users' multiple dimensions.This poses a challenge to traditional statistical methods that can only study the relationship between a single feature or a small number of characteristics and the risk of user loans.In view of the non-balanced distribution of data,this paper uses the method of feature engineering to extract features from the original data to the maximum of the original data for the use of the algorithm and model.By the way of statistical calculation and combination of cross features,this paper constructs two hundred dimensions new features that can be interpreted.Then cross validation is used to prevent over fitting in the process of model optimization.For the data contains a lot of noise and high dimension features,the first step of this paper is to clean the data,remove duplicate data,redundant data,time stamp complement,outlier processing and so on.Then through the investigation and research method,we study the relatively well-known and safe and reliable online credit products,analyze the features based on the comprehensive evaluation,as the reference of this feature extraction,use the XGBoost algorithm to sort the characteristics according to the feature importance,and get the feature data set to remove a lot of noise at low latitudes.In the second step,the more efficient features selected by the XGBoost model are used as the input characteristics of the random forest.Taking into account the parallel nature of random forests,a random forest model was established on the parallel computing platform of Spark,and further research and analysis was conducted.Finally,this article uses the nearly 60 thousand loan user data of the Rong 360 platform to conduct experiments and use grid search to train the model.This article compares the effects of various combinations of parameters,evaluates the model through cross-validation,and obtains the optimal model.The final model performance is evaluated,and a good conclusion is obtained.It is proved that it is feasible to predict the risk of user loan through this model.
Keywords/Search Tags:Loan Risk, Spark, Random Forest
PDF Full Text Request
Related items