Understanding and representing multimodal data have always been a very important research topic in the field of artificial intelligence.An important branch of research is the use of deep probabilistic generative models to model multimodal data.In recent years,based on the variational autoencoder framework,there are fruitful research outcome for modeling multimodal data.However,because of the inherent characteristics of multimodal data(multi-type,heterogeneity,and redundancy),there still remain many problems to model them.In response to these problems,recent studies have shown that disentangling the shared and private information of multimodal data can effectively improve the performance of the model inference and data generation.Nevertheless,these studies may have the problem that the information of multimodal data has not been accurately extracted.In this regard,this thesis found that the alignment and fusion of shared information are the key factors.Therefore,this paper is conducted by including the method of metric learning and self-supervised learning.The main research results are as follows:For the representation and generation of multimodal data,this paper proposes a self-supervised learning-based disentangling multimodal variational auto-encoder(SD-MVAE)model.This model improves the effectiveness of disentangling and representing data mainly based on the following three treatments: 1)constructing a mechanism of multi-modal data generation by including shared and private latent vectors;2)fusing the shared latent vectors by applying the expert product function;3)aligning the shared latent vectors by using the self-supervised method based on triplet loss.The results of the experiments on the MNIST-SVHN and MNIST-CDCB multimodal datasets show that the SD-MVAE model can effectively disentangle and represent data.The related data representation can significantly improve the accuracy of data cross-generation and translation-generation as well as the quality of image generation.At the same time,it can effectively improve the effect of the model in downstream tasks such as classifying multi-modal data and retrieving cross-modal.Moreover,the SD-MVAE model has many model training parameters,and it is difficult to disentangle and represent different modal data.In response to these problems,this paper proposes a quadruplet metric loss based multimodal variational auto-encoder(Q-MVAE).This model optimizes the model structure and model training objective function,as well as includes a quadruplet metric learning loss.Therefore,with fewer training parameters,it can achieve model performance which is comparable with the SD-MVAE model.The results of the experiments on the MNIST-SVHN and Celeb A datasets show that the Q-MVAE model has good performance not only in data representation and generation performance,but also in downstream tasks.Furthermore,since the model also shows the potential to disentangle,represent,and generate multimodal data more fine-grained,it shows the prospects of applying this model in image processing to a certain extent.To sum up,in response to the problem of representing and generating multimodal data,this thesis proposes corresponding models and algorithms by including the research work of metric learning under the framework of variational autoencoders.These may provide a certain degree of thinking and technical support for the deep probabilistic generative model to process multi-modal data. |