Font Size: a A A

Image And Text Joint Modeling Method Based On Multimodal Weibull Variational Auto-Encoder

Posted on:2021-10-17Degree:MasterType:Thesis
Country:ChinaCandidate:S C XiaoFull Text:PDF
GTID:2518306050966909Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the continuous development of internet technology,information is more and more inclined to rely on data of multiple modals for dissemination.The data of these modals are usually highly correlated,and the data of each modal often carries unique information.In this context,learning on only single modal data has been difficult to meet people's needs,because it often leads to problems such as information loss.However,performing joint learning on relevant multimodal data can not only mine the semantic relationships between different modal data,but also enable the information carried by various modal data to complement each other,thereby obtaining more comprehensive information.Therefore,multimodal learning has become a hot spot in the field of artificial intelligence,and the joint modeling of image-text data is even more popular in the industry.This paper mainly focuses on the research of multimodal joint modeling method of image-text data,and proposes a Multimodal Weibull Variational Auto-Encoder,and improves and expands the model in the future.The main content of this paper is as follows:The first part mainly analyzes and summarizes the existing multimodal data modeling methods based on probabilistic generative models,and theoretically analyzes the advantages and disadvantages of various models.This paper first discusses multimodal data modeling method based on shallow probabilistic topic model and multimodal data modeling method based on neural network,and analyzes the advantages and disadvantages of these two methods,and then a Multimodal Poisson Gamma Belief Network based on a deep probabilistic topic model is discussed.This model can make up for the shortcomings of the above two methods,but it also has limitations,which are the research focus of this paper.The second part mainly introduces the Multimodal Weibull Variational Auto-Encoder proposed in this paper.Although the Multimodal Poisson Gamma Belief Network has the advantage that it can extract hierarchical latent representations which easy to interpret compared with the multimodal data modeling method based on shallow probability topic model and the multimodal data modeling method based on neural network,but it also has limitations such as difficulty in real-time prediction,difficulty in adding supervision information and auxiliary information.In order to retain the advantages of the Multimodal Poisson Gamma Belief Network and make up for its shortcomings,this paper proposes a Multimodal Weibull Variational Auto-Encoder.It uses the Multimodal Poisson Gamma Belief Network as the decoder and an inference network as the encoder.The inference network can reparameterize the Weibull variational distribution,and then use the Weibull variational posterior to approximate the true posterior distribution of the latent representation of the model,so that the input data can be directly mapped to the latent representation during the test phase,and the realtime prediction is realized.In addition,because the model uses network mapping,it is also easy to add supervision information and auxiliary information.Futher,this paper also uses a convolutional neural network to extract global feature from the image and add it as auxiliary information into the model to help the model improve multimodal joint classification performance.Finally,experiments on a variety of data sets prove that the Multimodal Weibull Variational Auto-Encoder can perform real-time prediction,and the model achieves better performance due to the ability to add supervision information and auxiliary information.In addition,the model can also visualize the hierarchical relationships of various modal data through topics.The third part first introduces the Attentional Multimodal Aligned Model.Using this model can extract rich detailed image-text joint features,and then add this joint features as auxiliary information to Multimodal Weibull Variational Auto-Encoder.This solves the two limitations of using the global feature of the image as auxiliary information: 1)lack of text information;2)lack of rich detailed information.Experiments further show that the model achieves the state-of-the-art multimodal joint classification performance.In addition,this paper also visually analyzes the attention relationship between the image subregions and text words learned by the model.
Keywords/Search Tags:Multiodal Data Modeling, Deep Probabilistic Topic Model, Multimodal Weibull Variational Auto-Encoder, Real-Time Prediction, Multimodal Joint Classification
PDF Full Text Request
Related items