Font Size: a A A

Application Of Ensemble Learning Based On Improved Mixed Sampling Method In Pre-lending Default Prediction

Posted on:2021-05-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y T MengFull Text:PDF
GTID:2518306302974519Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Along with the rapid growth of the economy in China,the attitude towards life and consumption has quietly changed from savings consumption to advanced consumption,which has promoted the rapid growth of China 's personal credit business market : Consumer finance,Internet finance,and P2 P platforms are flourishing.However,the immaturity of relevant systems and China's unique national conditions pose severe challenges to the risk control of credit industry.Due to the particularity of the industry,the data set of credit default prediction is usually highly imbalanced.In recent years,research on imbalanced classification has received widespread attention in the field of machine learning and has made significant progress.The methods of imbalanced data set classification mainly include data set reconstruction and algorithm reconstruction.This paper focuses on the two main research directions of imbalanced data set classification.Based on previous research,the related methods are optimized,and then applied to data imbalanced credit default prediction for research analysis.The main contents include:(1)There are three main methods for reconstructing the data set: undersampling,oversampling,and mixed sampling.This paper analyzes the characteristics and shortcomings of the classic oversampling SMOTE method and its various derived adaptive oversampling methods.An improved hybrid sampling method that combines isolated random forest outlier detection,improved SMOTE oversampling based on the positive sample rate of the local area and Tomplink data removal methods.The goal is to solve noise problems,intra-class sub-assembly,intra-class imbalance,aswell as category overlap issues.The experimental verification of the KEEL dataset shows that the hybrid sampling method further improves the classification efficiency of minority classes compared to other sampling methods.(2)Another effective way to solve the imbalanced classifier is to optimize the classifier algorithm to adapt it to the imbalanced dataset.This paper combines the improved hybrid sampling method with the Ada Boost ensemble classification algorithm.On the one hand,it optimizes the training samples in each iteration of the integrated method to improve the classification accuracy of a small number of samples.On the other hand,it improves the prediction performance of classification models.The experimental verification of the KEEL dataset shows that this method has improved AUC and G mean values ??compared to the traditional Ada Boost method and the classic SMOTEBoost method.(3)The classification algorithm designed in this article extends from traditional statistical models such as classic Naive Bayes and logistic regression to CART decision trees,KNN,and random forests in the field of machine learning.In the comparison and evaluation of models,the evaluation indicators of class classification:F1 score,G-mean and AUC,etc.,carry out multi-angle comparative analysis on the effects of different unbalanced data processing methods under different classification models.(4)For the data provided by Lending Club: at the data set reconstruction level,multiple machine learning classification algorithms are used to model based on the improved hybrid sampling method.The results indicate that the improved hybrid sampling proposed in this paper is more accurate in predicting the defaulting population,which improves the model's effect.At the level of the optimized classifier algorithm,the CART decision tree model is used as the base classifier for experimental analysis,which also improves the prediction effect on a small number of samples.From the test of KEEL dataset to the application ofcredit default prediction,it proves that the optimization methods proposed in this paper have certain generalization significance.
Keywords/Search Tags:credit default, classification, unbalanced data, hybrid sampling, ensemble learning, oversampling
PDF Full Text Request
Related items