With the advent of the era of big data and the development of data science,academic researchers and data practitioners pay more and more attention to the value of data,re-analysis research is more inclined to obtain more comprehensive data,at the same time,it also makes data jumbled,data structure complex and information loss phenomenon more and more common,which brings great challenges to data mining work.The processing of missing data is a very important part in the process of data preprocessing.Missing data interpolation is one of the most common methods in missing processing technology because it preserves the original information as much as possible and makes incomplete data complete.However,there are many limitations of the existing missing interpolation methods,such as the application of data loss mechanism is completely random missing situation,it is necessary to have a complete sample model training and other harsh conditions,which also leads to these interpolation methods of a very narrow scope.In this paper,the data missing mechanism and missing model are discussed,and three missing mechanisms of complete stochastic deletion,stochastic deletion and non-random deletion are simulated by two methods,namely,Uniform and column inhomogeneity,respectively,and the general missing model is simulated considering the generality.The original complete data is standardized and prepreprocessed,and then the missing value is simulated.Considering the consistency of index evaluation,this paper selects the data of 4 classification tasks of UCI website,including full numeric variable dataset and numerical variable,category variable mixed dataset,which can represent most data structures.In addition,the simulation data of multivariate normal distribution is simulated to analyze the reconstruction error.The mean square root error(RMSE)and the error rate(PFC)index based on category variables are selected to measure the data reconstruction errors,and the predictive ability of the reconstructed data is measured by the AUROC of the logistic regression.In order to be able to apply to different missing mechanisms and data structures,the algorithm proposed in this paper is an unsupervised algorithm to avoid embarrassing situations that require complete data for training.The algorithm applies the idea of model enhancement,first interpolation with unsupervised machine learning method,and then further interpolation with the generation countermeasure network interpolation method(GAIN)based on deep learning.The generation of antagonistic network interpolation(GAIN,initial interpolation is 0 value,in order to unify the markup,recorded as zero-GAIN)is also an unsupervised interpolation method,so such a superimposed interpolation method is also unsupervised,this paper refers to this method as an enhanced generation of confrontational network interpolation method,The shorthand is Boosting-GAIN.In order to explore the influence of the initial interpolation method,three methods of mean interpolation,K near-neighbor interpolation and missing forest interpolation are used as initial interpolation,and then compared and analyzed with zero-GAIN method.In this paper,both the real dataset and the analog dataset confirm that the Boosting-GAIN method is better than the zero-GAIN method in the ability to reconstruct the data under various missing mechanisms and in the reconstruction data.Among them,the Boosting-GAIN interpolation algorithm(MissForest-GAIN)based on missing forest is the best performance,and it is also the algorithm recommended in this paper.In addition,it is found that with the increase of the degree of deletion,there is a clear trend in every algorithm in this experiment: with the increase of the missing rate,the reconstruction error increases and the predictive ability decreases. |