Research On P2P Online Loan Default Prediction Model Based On Integrated Classification Algorithm

Posted on:2020-11-12

Degree:Master

Type:Thesis

Country:China

Candidate:W H Li

Full Text:PDF

GTID:2438330596471085

Subject:Information management and information systems

Abstract/Summary:

PDF Full Text Request

P2P(Peer-to-Peer)online lending is an Internet-based online lending model for individuals and individuals.As an important representative of Internet finance,it provides a convenient investment and financing channel for borrowers and investors.Along with the rapid development of the domestic online loan industry,P2 P online loan default events frequently occur,which adversely affects the investor's rights and the normal operation of the platform.The default prediction model can identify high-risk borrowers.The P2 P platform can help investors make more accurate decisions.However,due to the complicated data and uneven data distribution,it is difficult to establish effective default prediction model for P2 P online lending platform.In order to solve the above problems,this paper uses the integrated classification algorithm to classify the default prediction model with high classification accuracy,difficulty in over-fitting,strong generalization ability and suitable for complex data sets,and the difference between the normal repayment and the number of default samples in the data set.Large,that is,the problem of unbalanced data distribution,using sampling technology for equalization processing to improve the classification performance of the model.Specific work: firstly analyze and process the data,study the distribution of data and variable types,and determine the important indicators of the risk prediction model through data cleaning and feature engineering;then use the random forest algorithm and LightGBM(Light Gradient Boosting Machine)algorithm to construct the default prediction model,debug the parameters,and select SMOTE(Synthetic Minority Oversampling Technique)oversampling,random undersampling and "SMOTE_TomekLinks" for the characteristics of data imbalance.The combined sampling method equalizes the data,and compares and analyzes the changes of model classification performance before and after data equalization.Finally,the default analysis model is used to analyze the importance of the risk factors of default risk,and 10 pairs of the two models are obtained.Features that have important implications.Based on the personal loan data of Lending Club platform in 2012-2018,the results show that the accuracy,F1 Score and AUC values of the two models are improved after data balancing;The random forest algorithm has better classification performance than the other sampling methods under the combined sampling method.The LightGBM algorithm has better classification performance than other sampling methods under the SMOTE oversampling method due to its own algorithm characteristics.The accuracy of both models is higher than 86%,which is much higher than the platform's average compliance rate of 79.83%.By comparing the final model,it is found that the evaluation index of the random forest algorithm is slightly better than the LightGBM algorithm,but the random forest algorithm is less efficient.It is found that the credit level,FICO score,bank account number and other characteristics have serious consequences for users to default.influences.Therefore,the model not only improves the prediction accuracy,but also effectively solves the problem of data imbalance.It can also help the platform to screen high-quality borrowers and play a positive role in controlling credit risk on the P2 P online lending platform.

Keywords/Search Tags:

P2P Network Lending, Integrated Learning, Random Forest Algorithm, LightGBM Algorithm, Unbalanced Data

PDF Full Text Request

Related items

1	Based On Multi-Sensor Data Fusion Algorithm Of Integrated Monitoring System
2	Research On Quantitative Investment Strategy Based On Integrated Algorithm
3	Research On Extraction Method Of Industrial Control Network Security Situation Elements Based On Random Forest
4	Research And Application Of High Dimensional Imbalanced Data Classification Based On Random Forest
5	Research On Optimization Of Random Forest Algorithm And Its Application In Text Parallel Classification
6	Research On Intrusion Detection Technology Based On Random Forest Algorithm
7	The Research On Random Forest And Its Parallelization Oriented To Unbalanced High-dimensional Data
8	Application Research Of Unbalanced Data Classification Algorithm Based On Integrated Learning
9	Research On The Expansion And Classification Of Several Imbalanced Data Sets Based On C-SMOTE Algorithm
10	Research On Adaboost Improved Algorithm For Unbalanced Data