Repayment probability prediction is to analyze the repayment risk of borrowers and then to predict the repayment probability by integrating their basic information,transaction information and other data.With the rapid development of the small loan industry,the number of borrowers surge,the bad debt rate also increases significantly,causing a large number of economic losses.However,the traditional collection method has been unable to deal with the increasing overdue data,as well as complex and changeable business scenarios.This paper takes the overdue data of small loans as the research object and aims to improve the collection efficiency,and carries out the research on the repayment probability prediction model.Recently,common collection scoring models,such as default probability model and loss degree model,use data mining and statistical methods to analyze collection data.But there are some defects such as low model accuracy and inability to process large quantities of data due to too simple methods.Mainstream credit risk evaluation methods both at home and abroad almost use machine learning algorithm to construct risk prediction model.Other methods use feature engineering to improve model accuracy,but these methods are dependent on users' credit investigation information.This paper draws on both the data processing and model building technique of above methods,applying machine learning algorithms to repayment risk analysis on overdue data.At the same time,this paper proposes a method of building dataset based on the collection features,and designs a repayment probability prediction model based on attenuation weight,as follows.(1)A method of building dataset based on the collection features.This method does not need credit investigation information and only relies on the original data of overdue customers.Firstly,process the high-dimensional data,involving filtering data items of the original data and extracting a word vector model based on word2 vec from the recording data.Then a keyword network is constructed and visualized to realize the keyword extraction,which provides reference for the selection of data items.Secondly,we design a feature transformation rule of data items to extract features according to theirs types,converting data of qualitative,customer address,date,mobile phone number and other non-numerical data into numerical features or feature encoding.Thirdly,the z-score method is used to complete data normalization.Fourthly,we design a label extraction algorithm to generate labels of dataset samples.While sorting out repayment data,this algorithm makes the sample label more flexible and generates different label results by adjusting the threshold value,so as to objectively reflect the repayment intention of customers.(2)A repayment probability prediction model based on attenuation weight.Firstly,we introduce the attenuation weight of samples which is calculated according to the temperature-time attenuation function following Newton's cooling law.Secondly,train the XGBoost model after adding attenuation weight and adjusting parameters.Thirdly,conduct feature importance analysis on the trained model.Fourthly,perform feature selection to eliminate invalid redundant features,and then repeat the training of the model until the model meets the requirements.This paper constructs a dataset containing 144561 samples based on five months overdue customers' data.To validate the usability of the dataset,we use four mature machine learning methods,including logistic regression,GBDT,random forests and XGBoost.The experiment proves that the dataset is characterized by a stable performance under a variety of methods and all methods' AUC values are slightly higher than 0.73.Addressing to the optimized repayment probability prediction model based on XGBoost,we design some experiments to test and analyze.By comparing the prediction results of the repayment probability prediction model with attenuation weight and the one with default weight in the three groups of comparison experiments,we find that the attenuation weight has a significant advantage over the default weight.The AUC value of the test set in comparison experiment 3 is rising to 0.7021.Then,we test the product feature which is extracted by mean encoding or not.The experimental results show that only the feature distributions of the feature in both the training set and the test set are consistent,the AUC value of mean encoding will rise to 0.7036.Once the distributions are inconsistent,the result of one-hot encoding is more stable and higher than the result of mean encoding. |