| Objective: This study aims to optimize the imputation strategy,solve the problem of whether to add outcome variables(two-category)in the imputation model,and on this basis,compare the imputation effects of several most commonly used imputation methods in current research under different missing rates by simulating the missing data set,and comprehensively recommend the imputation strategy and corresponding imputation methods suitable for the processing of missing data in cross-sectional research,So as to provide data and theoretical support for reasonably solving the problem of missing data in cross-sectional research,and also provides new ideas for the research of missing data imputation methods.Methods: This study first used the Indian liver patient data set(ILPD)in the UCI public database to establish a simulation of 50 random deletion data sets with different deletion rates,and the deletion rate gradually increased from 5% to 30%.Secondly,on the basis of each deletion rate,different imputation strategies(whether outcome variables or not)and imputation methods were used(multiple imputation,random forest imputation,Knearest neighbor imputation)to simulate missing datasets.Then,under the corresponding imputation strategies and methods,compare the imputation accuracy of different methods under different missing rates,and then conduct logistic regression analysis on the data set after imputation to explore the performance of the imputation effect in the analysis of influencing factors.Finally,using the liver fibrosis data of the liver disease clinic,the advantages and disadvantages of the imputation method under different deletion rates were determined,and the role of outcome variables in the imputation was verified.Results: As the missing rate increases,the imputation accuracy gradually decreases.At5% missing rate,each method has the smallest MMAPE and the highest MA.At missing rate of 30%,each method has the largest MMAPE and the lowest MA.In the ILPD dataset,the situation varies slightly with the change of missing rate.At 15%missing rate,the MMAPE is smaller without the inclusion of outcome variables,while above 15% missing rate,the MMAPE is smaller with the inclusion of outcome variables;In the imputation of discrete variables,except for multiple imputation with 15% and20% missing rates,the imputed MA were slightly higher without the inclusion of outcome variables,and their values were 0.9537 and 0.9547,as well as 0.9538 and0.9399,respectively,with a difference of only 0.001 and 0.002.Under other conditions,the imputed MA were larger with the inclusion of outcome variables;Multiple imputation showed a slightly higher MADC with the addition of outcome variables at5% missing rate,with values of 0.2808 and 0.2737,respectively,compared to the MADC without the addition of outcome variables.Under the same other missing rates and imputation methods,the MADC index was smaller with the addition of outcome variables.In the liver fibrosis data,the MMAPE of random forest imputation and KNN imputation were slightly higher when the outcome variable was added under the condition of 5% missing rate,which were 0.065 and 0.069 respectively;Under these conditions,the MMAPE are smaller after adding outcome variables;At 15% missing rate,the multiple imputation MA after adding outcome variables were slightly lower than the indicators without adding outcome variables,with values of 0.9468 and 0.9469,respectively,with a difference of only 0.0001;Under other conditions of imputation,the MA with the inclusion of outcome variables is greater;When the missing rate is 5%,for KNN imputation,the MADC after adding the outcome variable is slightly higher than the MADC without adding the outcome variable,with values of 0.1079 and 0.0082,respectively;The MADC of random forest imputation without adding the outcome variable is 0.2662 at a loss rate of 10%.In other cases,the MADC is smaller after adding the outcome variable.Conclusion: In the selection of imputation methods,considering the accuracy of imputation,the random forest imputation method is the best regardless of the missing rate;In terms of exploring the regression coefficient of influencing factors,multiple imputation is more recommended regardless of the missing rate.In terms of imputation methods,adding outcome variable imputation is relatively better than adding outcome variable imputation in different missing rates and different imputation methods.When the missing rate is large,random forest imputation and multiple imputation have better performance.In terms of imputation accuracy,random forest imputation after adding outcome variables was recommended,and this method performed best.In terms of influencing factor analysis,it is recommended to add outcome variables and perform multiple imputation,which is the most stable and accurate. |