| Imbalances are very common in the real world,for example in the case of bank lending,where the vast majority of transactions are normal and only a few are abnormal,or in the case of bond defaults,where the majority of bonds are normal and only a few default events occur.These imbalance problems are usually caused by the nature of the business itself,and in these areas it is often a small sample of categories that are of interest.In order to study the class imbalance problem and to bring the model to its proper level,it is often necessary to treat the original data set with the class imbalance problem.Therefore,the category imbalance problem is of great research importance.At present,the category imbalance problem has become a hot spot of attention in academia and industry,and many scholars have conducted relevant research on the category imbalance problem,they mainly use data-level methods,algorithm-level methods and hybrid methods to deal with category imbalance data,but there are still some urgent problems to be solved.At the same time,the scale of China’s bond market has increased dramatically under the east wind of economic development and defaults have occurred frequently,as the bond default problem with obvious characteristics of category imbalance urgently needs to be developed and studied.The main research work of this paper includes:(1)To address the problems of many category features contained in bond default data,the existing category feature processing algorithms are inefficient,may cause dimensional explosion,and depend on model parameter settings,the Cat Boost model is applied to efficiently process the category features.The Cat Boost model based on symmetric decision tree as the base learner has fewer parameters,and in general,better model results can be obtained by using the default parameters of the algorithm,which reduces the need for hyperparameter tuning of the model,and the Cat Boost algorithm can integrate the processing of category-based features without complex data pre-processing,making the model more efficient to use.The algorithm also supports custom loss functions,which can select a more suitable loss function for training according to the problem studied by the model.This paper uses bond default data as the model input and makes full use of the category features of bond default data,which has certain application value and practical significance.(2)To address the existing problems of category imbalance data mainly focusing on the balanced number of samples,ignoring the learning contributions of samples with different difficulty levels,and the classification boundaries of classifiers in the category imbalance problem,the explicit gradient learning data augmentation(EGLA)method based on the meta-learning framework and the model customizing the improved GHMNA loss function are applied to solve the category imbalance problem.The adjustment of the decision boundary of the classifier is accomplished by replicating a small number of class samples and second iteration,and the approximate gradient density is obtained by nonlinear weighted average to improve the GHM_C loss function,which fully utilizes the historical gradient information and preserves the influence of historical information on the output.The empirical data show that the F1 score performance of EGLA is improved by 3.3%,2.1%,and 1.6% over ADASYN,SMOTE,and B-SMOTE,respectively,when the loss functions are all traditional logarithmic loss functions.Both the empirical results and the generalization experimental results show that the algorithm improves the shortcomings of the traditional category imbalance problem processing algorithm by assigning different weights to the learning contributions of samples with different levels of difficulty,and also has a competitive advantage compared with algorithms in the literature.(3)To address the problems of complex gradient calculation,large model loss and low training speed,and possible model degradation,the custom loss function GHM_C loss function used in this paper is non-linearly accelerated.By summarizing all useful information from a global perspective and synthesizing the output of the entire iterative process of the algorithm to extrapolate an approximate gradient density that is infinitely close to the optimal one,the empirical results show that when the imbalance treatment method is the ADASYN algorithm,the F1 score of the improved GHMNALoss loss function is 4.1% higher than that of the CE loss function,Log Loss loss function,GHMLoss loss function by 4.1%,4.4%,and 2.0%,and the results indicate that the improved loss function improves the efficiency and accuracy of classification,and the algorithm is feasible after further comparison and verification through generalization experiments. |