Comparative Study On Imputation Methods Of Missing Data In XGBOOST Model Under Complete Random Missing Mechanism

Posted on:2022-10-31

Degree:Master

Type:Thesis

Country:China

Candidate:F Fang

Full Text:PDF

GTID:2518306335454664

Subject:Art and design

Abstract/Summary:

PDF Full Text Request

XGBOOST(Extreme Gradient Lifting)algorithm is a classification algorithm with high classification accuracy and fast calculation speed,which has been widely used in many fields and achieved good application results.However,in practical problems,the lack of data is a common problem,which will lead to the reduction of sample information and affect the classification results.At present,there is no research showing the impact of data loss on the classification effect of this model,which is an important issue worth studying for the application of XGBOOST model.In this paper,the influence of completely random missing data on the classification accuracy of XGBOOST model is studied.By comparing the interpolation effects of various data interpolation methods,the optimal interpolation method is found and optimized.The work of this paper is as follows: On the basis of introducing the basic principle of XGBOOST model algorithm,nine interpolation methods-mean interpolation,regression interpolation,EM algorithm interpolation,multiple interpolation,K nearest neighbor interpolation,neural network interpolation,decision tree interpolation,random forest interpolation and missing forest interpolation are investigated.Nine medical data sets(3 continuous,3 discrete and 3 mixed)were selected,and the missing data sets with5%,10%,20% and 30% missing rate were randomly generated according to any missing mode by R language program.XGBOOST classification model was established,and the classification effect of the model was evaluated according to model evaluation indexes�accuracy,F1 value and AUC value.It was found that the existence of missing data had obvious influence on the classification accuracy of XGBOOST model.The XGBOOST model is established to compare and evaluate the classification effect after interpolating the missing data sets with the above nine interpolation methods.The conclusions are as follows: 1.The existence of missing data does affect the classification accuracy of XGBOOST model;2.When the missing rate is 5%,the interpolation effect of each interpolation method is close,but with the increase of missing rate,the interpolation effect difference of each interpolation method gradually increases,and the interpolation accuracy tends to decline;3.The interpolation effect of the missing forest interpolation method is generally better.However,when the data missing rate is higher(> 20%),the interpolation effect of this method is only good for mixed data sets(both continuous data and discrete data),but it is not the best for continuous or discrete data.4.In order to further improve the interpolation effect of the missing forest interpolation method,a new method integrating K nearest neighbor interpolation and missing forest interpolation is proposed: MF-KNN interpolation method,and experiments prove that this method has better interpolation effect on all types of data sets with different missing rates(especially higher missing rates).5.Although in theory,the parameter estimation of statistical model is unbiased under the condition of completely random missing,which is generally considered as a negligible missing,the experiment in this paper proves that the completely random missing data can not be ignored in specific applications.

Keywords/Search Tags:

XGBOOST model, Completely missing at random, Missing forest imputation, MF-KNN imputation

PDF Full Text Request

Related items

1	Nonparametric Imputation For Missing Data
2	Studies On Missing Data Imputation
3	The Analysis And Improvement Research Of Knn-imputation Algorithm
4	Research On Strategy Of Imputing Missing Data Based On Random Forest
5	Attribute Associated Neuron Modeling And Missing Value Imputation Based On Neural Network
6	Research On Missing Value Imputation Of Incomplete Data
7	Research On Adaptive And Robust Missing Value Imputation Algorithm
8	The Online Imputation Method Of Missing Value Based On KNN And Its Application In Credit Evaluation
9	Research On Missing Value Imputation Method Based On Mixed Information System
10	Missing Value Imputation Based On TS Modeling With Alternate Learning