In recent years,in order to obtain improper benefits,more and more companies have distorted objective facts,used untrue and incomplete accounting information,and fabricated and tampered with the relevant data of financial reports based on false and missing accounting information,which not only violated the provisions of the national unified accounting system,but also affected the order of the market economy,and investors will also make wrong decisions due to being misled,and eventually suffer investment losses.Based on this,the construction of a financial data fraud prediction model can help investors accurately identify companies with financial data fraud,simplify the investment decision-making process,protect investment interests,reduce investment risks,maintain market economic order,and improve risk management level.The specific studies are as follows:Firstly,based on the obtained financial data of real manufacturing listed companies,the data are descriptively analyzed and outliers and missing values are processed,and SMOTE sampling,ADASYN sampling,Near Miss sampling are compared and analyzed,and SMOTETomek sampling performs on the model,and SMOTE sampling performs the best.Secondly,the variance filtering and mutual information method in the filtering method are used as the first round of feature screening for feature selection,and the features with zero variance and independent labels are removed.Then,the feature screening based on the penalty term(Logistic regression algorithm)and the tree model(Random Forest,Ada Boost,XGBoost,GBDT and light GBM algorithms)were used as the feature selection for the second round of feature selection,and combined with the feature voting selected by the six models,the index with a number of selections greater than or equal to two(selected by at least two algorithms)was used as the final index of the model.Finally,although the research of a single prediction model is relatively mature,the evaluation index is not too high,and the pan-China ability of the model is not very strong.In order to improve the prediction ability of the model,considering the establishment of a single model to further combine the models and compare the prediction effect of the classification algorithm,this paper selects four models of GBDT,light GBM,Adaboost and XGBoost as the primary learner of Stacking model fusion,and the simple LR model as the secondary learner of the fusion model.The Stacking fusion model constructed in this paper has a prediction accuracy of 0.8093,a specificity of 0.8109,a recall rate of 0.6207,and an AUC value of 0.7619 on the test set,which improves the accuracy by 0.18% and the specificity by 0.18% compared with the best-performing GBDT model in a single model,while keeping the fraud recall rate unchanged.In summary,the Stacking fusion model constructed for the falsification of financial data of listed companies in the manufacturing industry has certain reference value and practical significance. |