| With the advent of the data era,the acquisition of massive data is no longer a problem,but how to improve the quality of obtained data has become more and more important.High-quality data is the guarantee for training high-quality models.In the process of data acquisition,subjective or objective reasons will always lead to a certain degree of missing data,and this missing data greatly affects the quality of the data itself.At present,common processing methods for missing data include direct deletion method,imputation method based on probability model,and traditional machine learning imputation method.With the development of artificial intelligence technology,scholars began to try to use generative models based on deep neural networks to deal with the lack of data.Variational autoencoder(VAE),along with other generative models,have been shown to efficiently and accurately capture the underlying structure of large volumes of complex high-dimensional data.However,existing VAE still cannot directly handle datasets with missing data.In this paper,we propose a generative model for VAE based on a multi-head self-attention mechanism.Imputation of missing values can be performed on incomplete data with attributes of mixed data types.The generative model proposed in this paper maps the values of categorical data types and continuous data types to high-dimensional space respectively.Map a single value into a high-dimensional vector,and use the multi-head self-attention module in the high-dimensional space to obtain the relationship between vectors.In the decoder of the model,this relationship is used to restore the original value to complete the process estimation of the missing part of the data.In order to verify the effectiveness of the generative model,this paper selects four classification datasets with different data type attributes to simulate missing data.Missing data were simulated using traditional imputation methods and the generative model imputation method proposed in this paper in the four datasets.Comparative experimental results show that the imputation performance of the proposed method is better than that of existing methods on most datasets.Further experiments show that the generative model proposed in this paper outperforms the partially supervised classification model in the label prediction task of binary classification. |