Application Of Ensemble Learning Based On Improved Mixed Sampling Method In Pre-lending Default Prediction

Posted on:2021-05-07

Degree:Master

Type:Thesis

Country:China

Candidate:Y T Meng

Full Text:PDF

GTID:2518306302974519

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

Along with the rapid growth of the economy in China,the attitude towards life and consumption has quietly changed from savings consumption to advanced consumption,which has promoted the rapid growth of China 's personal credit business market : Consumer finance,Internet finance,and P2 P platforms are flourishing.However,the immaturity of relevant systems and China's unique national conditions pose severe challenges to the risk control of credit industry.Due to the particularity of the industry,the data set of credit default prediction is usually highly imbalanced.In recent years,research on imbalanced classification has received widespread attention in the field of machine learning and has made significant progress.The methods of imbalanced data set classification mainly include data set reconstruction and algorithm reconstruction.This paper focuses on the two main research directions of imbalanced data set classification.Based on previous research,the related methods are optimized,and then applied to data imbalanced credit default prediction for research analysis.The main contents include:(1)There are three main methods for reconstructing the data set: undersampling,oversampling,and mixed sampling.This paper analyzes the characteristics and shortcomings of the classic oversampling SMOTE method and its various derived adaptive oversampling methods.An improved hybrid sampling method that combines isolated random forest outlier detection,improved SMOTE oversampling based on the positive sample rate of the local area and Tomplink data removal methods.The goal is to solve noise problems,intra-class sub-assembly,intra-class imbalance,aswell as category overlap issues.The experimental verification of the KEEL dataset shows that the hybrid sampling method further improves the classification efficiency of minority classes compared to other sampling methods.(2)Another effective way to solve the imbalanced classifier is to optimize the classifier algorithm to adapt it to the imbalanced dataset.This paper combines the improved hybrid sampling method with the Ada Boost ensemble classification algorithm.On the one hand,it optimizes the training samples in each iteration of the integrated method to improve the classification accuracy of a small number of samples.On the other hand,it improves the prediction performance of classification models.The experimental verification of the KEEL dataset shows that this method has improved AUC and G mean values ??compared to the traditional Ada Boost method and the classic SMOTEBoost method.(3)The classification algorithm designed in this article extends from traditional statistical models such as classic Naive Bayes and logistic regression to CART decision trees,KNN,and random forests in the field of machine learning.In the comparison and evaluation of models,the evaluation indicators of class classification:F1 score,G-mean and AUC,etc.,carry out multi-angle comparative analysis on the effects of different unbalanced data processing methods under different classification models.(4)For the data provided by Lending Club: at the data set reconstruction level,multiple machine learning classification algorithms are used to model based on the improved hybrid sampling method.The results indicate that the improved hybrid sampling proposed in this paper is more accurate in predicting the defaulting population,which improves the model's effect.At the level of the optimized classifier algorithm,the CART decision tree model is used as the base classifier for experimental analysis,which also improves the prediction effect on a small number of samples.From the test of KEEL dataset to the application ofcredit default prediction,it proves that the optimization methods proposed in this paper have certain generalization significance.

Keywords/Search Tags:

credit default, classification, unbalanced data, hybrid sampling, ensemble learning, oversampling

PDF Full Text Request

Related items

1	Research On Oversampling Ensemble Learning Algorithm For Unbalanced Classification
2	Unbalanced Data Classification Based On Resampling And Hybrid Ensemble
3	Research On Outlier Detection For Unbalanced Data
4	Research On High-dimensional Unbalanced Data Classification Algorithm Based On Feature Selection And Ensemble Learning
5	Classification And Application Of Ensemble Learning In Unbalanced Data
6	Research On SVM Classification Of Unbalanced Data And Its Application In Identify Poor Students In Colleges And Universities
7	Research On Classification Algorithms For Unbalanced Data
8	Research On Credit Default Risk Control Under Imbalanced Data
9	Research And Application Of Integrated Algorithms For Unbalanced Data Sets
10	Classification Research For Unbalanced Data Based On Hybrid-sampling