Font Size: a A A

Comparative Study On Imputation Methods Of Missing Data In XGBOOST Model Under Complete Random Missing Mechanism

Posted on:2022-10-31Degree:MasterType:Thesis
Country:ChinaCandidate:F FangFull Text:PDF
GTID:2518306335454664Subject:Art and design
Abstract/Summary:PDF Full Text Request
XGBOOST(Extreme Gradient Lifting)algorithm is a classification algorithm with high classification accuracy and fast calculation speed,which has been widely used in many fields and achieved good application results.However,in practical problems,the lack of data is a common problem,which will lead to the reduction of sample information and affect the classification results.At present,there is no research showing the impact of data loss on the classification effect of this model,which is an important issue worth studying for the application of XGBOOST model.In this paper,the influence of completely random missing data on the classification accuracy of XGBOOST model is studied.By comparing the interpolation effects of various data interpolation methods,the optimal interpolation method is found and optimized.The work of this paper is as follows: On the basis of introducing the basic principle of XGBOOST model algorithm,nine interpolation methods-mean interpolation,regression interpolation,EM algorithm interpolation,multiple interpolation,K nearest neighbor interpolation,neural network interpolation,decision tree interpolation,random forest interpolation and missing forest interpolation are investigated.Nine medical data sets(3 continuous,3 discrete and 3 mixed)were selected,and the missing data sets with5%,10%,20% and 30% missing rate were randomly generated according to any missing mode by R language program.XGBOOST classification model was established,and the classification effect of the model was evaluated according to model evaluation indexes—accuracy,F1 value and AUC value.It was found that the existence of missing data had obvious influence on the classification accuracy of XGBOOST model.The XGBOOST model is established to compare and evaluate the classification effect after interpolating the missing data sets with the above nine interpolation methods.The conclusions are as follows: 1.The existence of missing data does affect the classification accuracy of XGBOOST model;2.When the missing rate is 5%,the interpolation effect of each interpolation method is close,but with the increase of missing rate,the interpolation effect difference of each interpolation method gradually increases,and the interpolation accuracy tends to decline;3.The interpolation effect of the missing forest interpolation method is generally better.However,when the data missing rate is higher(> 20%),the interpolation effect of this method is only good for mixed data sets(both continuous data and discrete data),but it is not the best for continuous or discrete data.4.In order to further improve the interpolation effect of the missing forest interpolation method,a new method integrating K nearest neighbor interpolation and missing forest interpolation is proposed: MF-KNN interpolation method,and experiments prove that this method has better interpolation effect on all types of data sets with different missing rates(especially higher missing rates).5.Although in theory,the parameter estimation of statistical model is unbiased under the condition of completely random missing,which is generally considered as a negligible missing,the experiment in this paper proves that the completely random missing data can not be ignored in specific applications.
Keywords/Search Tags:XGBOOST model, Completely missing at random, Missing forest imputation, MF-KNN imputation
PDF Full Text Request
Related items