| With the rapid development of Internet banking in China, one of its business online lending has begun to be accepted and used by the public. Due to the difference of the procedure of the online loan procedure and the procedure of the traditional bank loan, therefore, how to calculate the borrowers’ credit and risk is core problem of the business. At the same time, the core issue is also included keeping track of the borrowers’ economic circumstances change and predicting whether they will repay.In view of the large data volume of online loan business, and the characteristics of the model update iteration, the parallel computing platform based on random forest and Spark is designed and implemented. In view of the non balance(that is, the majority of the data is normal, and a few are overdue),we propose a method of improving the comprehensive sampling method to reconstruct the balance data. In addition, according to the problem of random forest algorithm, a weighted random forest algorithm is proposed, which is used to evaluate the performance of the decision tree using F1 values of OOB data.To sum up, we proposes a parallel weighted random forest algorithm based on Spark, which is based on the characteristics of the online loan overdue business requirements and data. Through experiments, we show that the method and the weighted improvement of the data can effectively improve the accuracy of the prediction and reduce the occurrence of the draw. In addition, the algorithm is better than the common classification algorithms such as SVM, logistic regression, C4.5, F1, and the traditional random forest, and has good scalability and good performance. |