Font Size: a A A

Research On The Application Of Improved CatBoost Algorithm In Unbalanced Classification

Posted on:2022-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:W Y GuoFull Text:PDF
GTID:2517306530977229Subject:Master of Applied Statistics
Abstract/Summary:PDF Full Text Request
How to identify potential default users as much as possible without losing creditworthy users,reduce the default risk that institutions need to bear,and prevent market economic bubbles,is one of the urgent problems that need to be resolved.In the past,machine learning algorithms usually perform modeling and learning under the assumption that the data is relatively evenly distributed.However,in reality,credit data is unbalanced,that is,the number of contract-abiding users far exceeds the number of defaulting users.The decision boundary of the classifier will be shifted due to the imbalance of the data category,which will eventually affect the classification effect.Secondly,when classifying imbalanced data,the loss cost of misclassification of the majority class and the minority class sample is not equivalent,that is,the cost is sensitive,but some traditional algorithms do not pay attention to this point,which is not easy to find out.The goal of interest.Cost-sensitive can be divided into categories: cost-sensitive and difficult-to-price-sensitive.In response to the abovementioned problems,this article attempts to make some improvements in data distribution and misclassification costs,and finally uses machine learning methods to predict whether users will default.This article uses the credit data released by Lending Club for the first and second quarters of 2018 for experimental research.At the data level,Conditional Generative Adversarial Networks(CGAN)is used for tabular data.By learning the feature information of the minority class,a more realistic minority class data is finally generated.The synthesized sample size is the same as the original positive class data amount.Same;at the same time,random sampling of the negative class samples is performed at 0.2 to make the data distribution relatively smoother.At the algorithm level,the more cutting-edge and efficient algorithm Cat Boost is selected,and the loss function of the original algorithm is rewritten for cost-sensitive issues.The final model performance index AUC increased from0.7631 to 0.7733.
Keywords/Search Tags:Im Balance Data, Credit default, CTGAN, Cat Boost, Focal Loss
PDF Full Text Request
Related items