Font Size: a A A

The Application Of Rule Extraction Based On Ensemble Tree Model In Personal Credit Risk Control

Posted on:2021-10-24Degree:MasterType:Thesis
Country:ChinaCandidate:X M LuFull Text:PDF
GTID:2510306302474514Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
With the development of the Internet and big data,as well as the improvement of residents' consumption level,personal credit business has also been continuously expanded and upgraded.Risk control technology is the core of the sustainable development and profitability of personal credit business.The platform mostly uses big data technology to control the risks through a combination of strategies and models.Among them,most of the development of strategies and rules rely on tools such as decision trees.This approach is not only inefficient,but also makes the corresponding strategy formulation cycle longer,and the effectiveness of the formulated rules is more susceptible to the experience and capabilities of risk analysts.For this reason,this article explores how to improve the extraction of risk control rules based on tree-based ensemble models.The purpose is to use statistical methods to make the cycle of rule extractions more efficient and automated,and to improve the current situation of rule extraction in credit risk control.The author first introduces some theories related to tree-based ensemble models and interpretation of machine learning models,and also the algorithms named in Trees and defrag Trees that are aimed to transform the ensemble models to rule sets.This article uses a German credit dataset from the UCI database.The dataset is first pre-processed mainly by transforming categorical variables to numerical ones with WOE method.Information value is also calculated for potential feature selection.Then,the author trains random forest and other tree-based ensemble models based on training data,and extracts rule sets with in Trees and defrag Trees.A series of evaluation indicators are also developed from the aspects of prediction accuracy and rule attributes.The indicators are used to compare and evaluate the performance of rule sets extracted from Random forest,XGBoost and GBM with in Trees and defrag Trees separately,with help of Friedman test and post-hoc Nemenyi test.Some baseline models like decision tress,logistic regression and association rules based classification model are also set for evaluation.Among them,although traditional models such as logistic regression have high predictive performance,they cannot directly generate rules on demand.A single decision tree model has a natural rule structure,but rules formed by a single tree model may lack some diversity and trapped in limited interpretation dimensions.Also,no overlap between rules may further reduce such diversity.For decision trees,pruning can be performed through parameter tuning,but the impact on rule-related attributes is difficult to grasp.Moreover,when rules need to be extracted according to certain business requirements,there is hardly any flexible and convenient way,other than manual repeated attempts.The rule sets extracted based on association theories are susceptible to the influence of data imbalance.The rules generated mostly lead to the dominant category and they are always redundant.Also,one single rule only covers a small sample size.Rule sets extracted with in Trees framework are of high quality,mainly because of the procedures like pruning and selection in in Trees.Therefore,problems like overfitting and redundancy are partly solved.The number of rules returned is equivalent to those produced by decision tree model with a max depth of 4.With little difference in accuracy,the length of the rules produced by in Trees is shorter,and the coverage of a single rule is also better,taking into account both prediction performance and interpretation performance.Compared with the performance of original ensemble models,no matter from the overall accuracy or misjudgments of one classification,the rule sets processed by in Trees are not inferior in terms of accuracy and have stable performance.The theories of defrag Trees framework is different from in Trees,which makes the stability of rule-extracting performance not as good as that of in Trees.It can convert the original ensemble model into rule sets containing K rules according to a given number K.In the credit dataset,rule sets generated from defrag Trees are more advantageous in terms of recall rate(0-CR)for non-default users,and similar conclusions are also valid for rule-wise situations.But it is weaker in terms of overall prediction accuracy.Since the overall classification accuracy(AR)is selected as the criterion for selecting the best model for each iteration,there are some tradeoff between other accuracy indicators.But the criterion could be easily changed and users could take trade-off preferences into account if the analysts have some certain business needs.Overall,it is more flexible and convenient when conducting business analysis.The in Trees and defrag Trees frameworks have their own characteristics when performing rule extraction for ensemble models.In Trees shows a better performance on the recall rate of the default category samples,and the rules has a high coverage rate.Also,they have stable performance and are not easily fail in the test set.Therefore,it is more suitable for risk analysts to extract rules for risk control in their daily work.On the other hand,defrag Trees is more suitable for some heuristic rule exploration due to its own characteristics.Moreover,cases about rules extracted with in Trees and defrag Trees are also shown in the article.One may find it is easy to interpret the business logic in the rule sets.In summary,the article provides a complete,highly automated solution for extracting rules from ensemble models for risk control in personal credit business with in Trees and defrag Trees.The rules extracted are of high quality and good predictability.
Keywords/Search Tags:Credit Risk Control, Rule Extraction, Ensemble Model
PDF Full Text Request
Related items