Font Size: a A A

Research On Individual Credit Risk Assessment For Imbalanced Data

Posted on:2021-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhouFull Text:PDF
GTID:2428330623976448Subject:Engineering
Abstract/Summary:PDF Full Text Request
Internet technology has given rise to a large number of emerging industries and promoted the vigorous development of Internet finance.Whether it is the JD Credit Pay,the Ant Credit Pay or P2 P lending,more and more consumer credit products have entered people's lives.Before providing users with convenient and reliable services,many Internet credit products need to build a personal credit risk assessment model based on users' basic information and historical transaction data to predict possible default risks.Building a personal credit risk assessment model using machine learning algorithm is a common method to solve this practical problem.Credit data are usually unbalanced in categories.When traditional machine learning classification algorithm processes unbalanced data,it usually results in a small number of class samples being mistakenly assigned to a large number of classes,resulting in unsatisfactory prediction results.However,in practical problems,it is more important to correctly identify a few class samples.Therefore,how to classify unbalanced data effectively is of great research value.At the same time,credit data also has the characteristics of high dimension and many redundant features.How to make effective feature selection on the data,so that the selected feature subset can maximize the model generalization ability and save model training time while containing the most data information and the least noise features?Based on this background,this paper proposes an improved data resampling method and feature selection method to improve the recognition rate of a small number of samples in unbalanced credit data,which is used to process high-dimensional unbalanced credit data,and establishes an individual credit risk assessment model through gcForest.The specific research contents are as follows:(1)To balance the data by oversampling.An improved ADASYN data oversamplication method based on HVDM distance measurement is proposed to improve the efficiency and rationality of generating new samples in the process of oversamplication.(2)A feature selection algorithm based on the idea of minimum redundancy-maximum correlation is proposed.The AUC value of single feature is used as the measure standard of feature importance,and the feature subset with high information and few redundant features is selected by calculating Kendell correlation coefficient among features.(3)Based on unbalanced credit data at home and abroad,the deep forest algorithm gcForest was used to construct the individual credit risk assessment model.By improving the cascade structure in the deep forest,combining with XGBoost algorithm to enrich the original base classifier categories in the cascade layer,and further strengthening the ability of the whole forest to identify a small number of samples,the individual credit risk assessment model for unbalanced credit data is finally constructed.
Keywords/Search Tags:Personal credit assessment, Imbalanced classification, feature selection, gcForest
PDF Full Text Request
Related items