| From 2020 to 2022,China’s Banking and Insurance Regulatory Commission successively issued the Interim Measures for the Management of Internet Loans of Commercial Banks and the Notice on Further Regulating the Internet Loan Business of Commercial Banks.The contents of the document reveal that credit risk management and risk model building are the focus of attention and work of the Internet lending industry under the new circumstance.It is emphasized that commercial banks should strengthen their independent risk control capabilities and further improve loan management with risk control as the core.The use of financial technology makes credit risk identification more accurate and efficient.The core value of technology-driven credit data mining is to identify and predict possible credit risks through the training of historical data and to fully learn risk-related features from historical data,so as to bring effective decision support for relevant institutions.The problems in credit risk management based on data mining include the single perspective of customer segmentation and the lack of applicability of existing methods,which leads to the lack of pertinence of loans;Due to the serious data imbalance,high attribute dimension,and mixed information contained in credit data,it leads to practical problems such as an inadequate description of customer characteristics and inaccurate risk assessment.Therefore,the task of credit data mining has certain particularities and challenges,and more advanced mining methods are needed to adapt to the characteristics of credit data and improve the ability of credit risk identification.Facing the above practical problems,the scientific problems studied in this thesis include:(1)Credit customer segmentation from the perspective of default risk and misclassification degree.Based on the heterogeneous ensemble learning method,credit customers are segmented from the perspective of default risk,and studies the characteristics of different customer subclasses.(2)Default risk assessment with the goal of solving the problem of data imbalance and reducing the misclassification of actual losses.Combined with misclassification loss,misclassification degree,and other factors,constructing Sample Concern Matrix for Customer Segmentation Classification,A Focusing Matrix combining Boosting-RF algorithm(FOMBRF)is proposed to fuse the sample focusing matrix.(3)Classification analysis space recognition.Aiming at the problems of high dimension and mixed information in credit data,based on business understanding,this thesis proposed a credit data attribute division method considering business process,and formed a credit data object-attribute space division method combined with customer segmentation.Based on the divided data,different customers are classified and customer characteristics are mined from different dimensions.The main innovative results of this thesis are as follows:(1)A credit customer segmentation model and risk feature learning method are proposed from the perspective of default risk and classification ability.Traditional customer segmentation takes customer value and customer loss as the perspective,and takes precision marketing as the research target,while the objective of credit customer segmentation is risk management.Based on the idea of heterogeneous ensemble learning,through the prediction results of multiple base classifiers on credit data,it is found that customers with the same class label also have different degrees of default risk.Then,according to the prediction results of the base classifiers,the risk rating of the customer is given and the credit customer is divided into eight categories from the perspective of default risk,these include:the highest risk of breach of contract customers,target customers,non-performance customers that are most easily misclassified,the highest potential risk customers,etc.,and mining the important risk characteristics between customer segments,which provides a new perspective and method for the study of credit customer segmentation and customer characteristics.In addition,in the machine learning task of credit data with severely unbalanced data,Accurate identification of a small number of risk samples that are easily misclassified is important for further model optimization and reducing institutional losses.(2)Credit default risk prediction based on Focusing Matrix combining Boosting-RF(FOMBRF).Traditional methods for imbalanced data classification tasks usually improve the training results by changing the sample distribution or synthesizing new minority samples through different criteria,but it will cause different degrees of information loss or information distortion.Combined with the misclassification degree of each customer fine classification sample and the actual misclassification loss of each customer fine classification,the Focusing Matrix(FOMA)was proposed to improve the learning degree of the model for high-risk,high-loss,and difficult-to-classify samples.This thesis systematically analyzed the relationship between the variance and bias of the classification model.Combining the advantages of the Bagging strategy and Boosting strategy,a Boosting based Random Forest algorithm(Boosting-RF)was proposed to reduce the variance and bias of the model,and the proposed algorithm was verified by comparative experimental analysis.Combined with the proposed sample focusing matrix and Boosting-RF algorithm,a random forest Boosting algorithm with a focusing matrix was proposed to predict default risk.By comparing the measurement results of the model and the reduction of the actual loss brought by the model,the effectiveness and practical value of the proposed algorithm in the credit risk classification task are verified.(3)Research on credit customer characteristics based on object-attribute space division of credit data.Traditional feature dimension reduction methods will cause different degrees of information loss and filter valuable borrower-related features.Aiming at the practical problem that it is difficult to describe the customer risk characteristics caused by the high dimension and mixed information of credit data,this study understood the data based on business understanding,combined with the characteristics of information process and information category,proposed a credit data attribute classification method considering business process,and divided all the attributes into six types of pre-loan information and post-loan information.The pre-loan information includes the basic information attribute of the borrower,the information attribute of the loan application,and the information attribute of the loan object.Based on the data after attribute division,starting from different information dimensions and taking different variables as classification targets,the borrower characteristics are described from multiple angles,and the knowledge contained in the pre-loan information is fully mined to guide the credit granting and approval decisions.Combined with the credit customer segmentation method proposed in this thesis,the spatial division problem of credit data object-attribute is solved.According to the proposed method,this thesis realizes the four types of customers,including the highest default risk customer,the target customer,the most easily misclassified non-default customer,and the highest potential risk customer.From the perspective of the basic information of the borrower,the information of the loan application and the information of the loan object,the customer characteristics are described by specific rules.It provides high-value and efficient decision support for credit approval and targeted post-loan management. |