| The real estate industry occupies an important position in China’s economic system.At the end of 2021,the GDP of the real estate industry in the country accounted for6.78% of GDP.At present,some down payment loan products have appeared in China’s market,and even buyers who do not have the ability to repay can apply for loans from banks,so there often defaults.Therefore,the risk management of personal housing loans has become one of the important risks that many banks pay attention to and prevent.This article analyses housing loan data.Firstly,data preprocessing,feature engineering and data exploration are carried on the data set.Secondly,the variables are selected and the default risk prediction model are established.Finally,the optimal model is used to build the score card.The specific contents are as follows:(1)Data processing aspect.Firstly,data cleaning are performed,mainly for missing values and outliers in the data.For variables with missing rate greater than 30%,they are deleted;for variables with missing rate less than 30%,numerical variables are filled with median,and subtype variables are filled with mode.In this paper,3σ criteria are used to determine whether the data is outlier,and if the percentage of outlier is less than 1%,they are processed according to the method of missing values,and the rest are not processed.Next,feature engineering is carried out,mainly for variable derivation,data set division,feature sub-boxing,WOE value and IV value calculation,and the data imbalance problem is handled using a combination of oversampling and undersampling methods.(2)Data exploration aspect.This section uses descriptive statistics and histograms for data exploration and analysis.In terms of basic customers information,customers in the range of(20,25] have a higher default rate,and the default rate gradually decreases with age.female customers have a higher risk of default,and customers with low education have a higher default rate.In terms of clients’ family circumstances,unmarried clients have higher default rates and widowed clients have the lowest default rates.In terms of customers’ work,customers with short working years and low income have a higher default rate.(3)Feature selection.In this paper,the random forest model,logical regression and XGBoost model are used to obtain the comprehensive score of each variable combined with TOPSIS comprehensive evaluation method to sort,and a feature ranking table is obtained.Combining with the effect of the subsequent models,the number of features is determined,and finally 20 variables such as external standard score_3 and age are selected.(4)Establishment of default risk prediction model and credit scorecard.A variety of single default risk prediction models are constructed on the housing loan data,and the test set is used to evaluate the models.Each single model is used as the primary classifier of the fusion model and logistic regression as the secondary classifier.Secondly,all the established models are compared,and it is found that the fusion model has the best effect,and the recall rates for the non-default samples and the defaulted samples are 0.9697 and0.9269,respectively,and the AUC value of the model reaches 0.98 and the KS value is0.3933.Finally,credit score cards for housing loan customers are constructed based on the fusion model.According to the results obtained by the credit score card,all customers with a score of less than 370 points are defaulting customers.The score ranges from 420 to 480,and the default rate drops sharply.The group with a larger score shows a smaller default ratio,and none of the customers with a score of 702 or more defaults. |