Font Size: a A A

Research On Enterprise Dishonesty Based On Integrated Learning

Posted on:2021-09-14Degree:MasterType:Thesis
Country:ChinaCandidate:R Z ZuFull Text:PDF
GTID:2506306248455854Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Since China’s reform and opening up,numerous enterprises have sprung up in various fields,making great contributions to China’s economic development.At the same time,the enterprises themselves have achieved development and brought rich profits to themselves.Under the premise of interest return,some bad enterprises will make illegal and dishonest behaviors,which will damage the interests of consumers,society and even the country.Therefore,enterprise credit identification is an important part of the development of Chinese enterprises.The problem of enterprise credit identification is more inclined to the field of economics,which is based on the concept of economics.With the development of science and technology,the introduction of a new system to judge whether an enterprise is dishonest will help to create a more favorable atmosphere for the development of enterprises.The data set of this paper comes from the innovation and entrepreneurship competition of Shandong Province,which provides various indicators of many enterprises in Shandong Province,The specific website is: http://sdac.qingdao.gov.cn/common/cmptindex.html.The ultimate goal is to build a model based on these indicators to predict whether the enterprise is dishonest.This paper first deals with the data set,mainly including the missing value analysis,the abnormal value analysis using the box graph and the sample imbalance processing using the smote algorithm.Then,the original data is processed with feature engineering,new features are constructed,and useful features are screened.In the model building part,there are four stages.The first stage is to use four common basic learners to train the processed data,which are: Logistic Regression model,k-nearest-neighbor model,decision tree model and naive Bayesian model.Then the four basic models are stacking integrated,and compared with the four basic learners,the new model is integrated better results.The second stage is to use the xgboost model,lightgbm model,randomforest model and gbdt model which are common in integrated learning to fit the data set.The four integrated learning models all perform well.Then,based on the four integrated models,stacking integration is carried out,and the overall effect of the new model is reduced,so other attempts are needed to improve the model effect.The third stage is the construction of homogeneous stochastic disturbance integrated model.Using the four integrated learning models trained in the second stage,the importance of their own input characteristics are sorted.Then,the four integrated learning models train 20 models respectively,but the number of input characteristics of the 20 models is random,and the parameters of the 20 models are also random in a certain range,which ensures that there are certain differences between the models.Finally,the 20 models are simplified and averaged to get a new model.Compared with the single integrated model in the second stage,the overall effect is somewhat better Promote.The fourth stage is the model fusion stage,the main purpose is to further improve the model effect.First,we use the maximum information coefficient(MIC)to score the previously trained models,and then we select the models with great differences.Then we carry out simple weighted average fusion for the selected models,compare several fusion models,and select the model with the best performance,which is the final model of this paper.Last,from the perspective of practical application,using the existing data,the trained model is applied to the scene of bad debt rate constructed by ourselves,and the practical value of the model is embodied by conceiving an application scenario.
Keywords/Search Tags:XGBoost, Light GBM, GBDT, Random Forest, Ensemble learning, Model fusion
PDF Full Text Request
Related items