| Against the background of increasingly serious global environmental problems,countries around the world have gradually reached consensus on carbon neutrality and environmental protection.The core value concept of sustainable development has gained increasing popularity.As an emerging financial tool,green bonds actively support energy conservation and environmental protection projects,providing financial service channels and feasible options for financing and small and medium-sized enterprise investors.However,in recent years,the default events in the bond market have occurred frequently.How to reduce the risk of credit default as much as possible and control the credit spread of the bonds issued by enterprises is the focus of all relevant stakeholders.This thesis uses the integrated learning algorithm to build a green bond credit spread prediction model,and combines the scoring model to achieve the monitoring and management of credit spread risk.First of all,the indicator system is initially constructed and the original data is processed.Take the default distance calculated by KMV and PFM model modified by regression method as one of the important indicators of credit spread,and combine the selection of external environmental variables and internal factors including ESG rating index,corporate financial index,bond issuance information,etc.to initially build the indicator system.According to the actual situation,integrate the data obtained from different platforms,fill in missing values,clean and converse format.Secondly,the multidimensional indicators are filtered.Pearson correlation analysis,ANOVA,mutual information method,etc.in the Filter method are used to test the correlation between numerical variables,category variables and credit spreads,as well as the correlation between various category variables.The features that have little or no significant impact on the target variables are eliminated,and 21 indicators used to train models are obtained.Then,the processed green bond data is regressed and predicted for the credit spread.Single models such as Lasso regression and decision tree,typical Bagging algorithm random forest,boosting algorithms such as XGBoost,Ada Boost and Light GBM,as well as stacking integrated learning based on heterogeneous learners,are constructed respectively.Combined with Embedded feature selection,data are regressed and model effects are compared.Among them,the Stacking integration algorithm(taking random forest,XGBoost and GBDT as primary learners and Lasso regression as secondary learners)combines the advantages of statistical model and integrated learning,with the best effect.The R-squared is as high as 0.9697,and the error remains at a low level,passing the robustness test.Eventually,the Stacking model is selected for regression prediction,and the important influencing factors of the credit spread are explained with economic significance.Finally,the risk classification scoring model of green bond credit spread is constructed.Based on the criteria of credit default events,default distance,etc.,the pseudo-tagging method under semi-supervised learning is used to obtain all credit labels of green bonds,and the SMOTE oversampling technology is used to achieve data enhancement and balance the dataset.Combined with Wrapper feature selection,SVM-RFM,XGBoost and Null Importance-Ada Boost algorithms are introduced for classification.After comprehensive comparison,XGBoost model has the best effect,with accuracy of 0.8689 and F1 value of 0.8758.Based on the XGBoost classification and Logistic regression,the risk scoring model is constructed through the variable bucket and WOE value conversion.The AUC is 0.71,and the KS value is 0.3143,which shows that the model has good differentiation ability to distinguish the green bonds with low and high default risk,so as to set up a risk warning mechanism and further predict the trend of the credit spread. |