Font Size: a A A

Adversarial Multimodal Network For Video Question Answering

Posted on:2022-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:S Y SunFull Text:PDF
GTID:2518306524980119Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Recently,with the continuous enrichment of various media,Visual question answer-ing by using information from multiple modalities has attracted more and more attention in recent years.However,how to better understand mainstream visual and text modal scenes,which is a very challenging task,as the visual content and natural language have quite different statistical properties,and each modal has the continuity between the same content,and the correlation between different content.Especially in video question an-swering task,the difficulty of data processing and the richness of content have greatly increased,the understanding of video content and text has brought greater challenges to researchers.The main research contents of this thesis are as follows:1.The focus of thesis is to design a new model method for multimodal video question answering tasks,so as to understand the question and answer related to video stories better.In order to facilitate the follow-up research,the basic knowledge of visual questio answering and deep neural networks firstly Introduced.2.In order to efficiently solve the fusion problem of multiple modalities in video question answering,a model with higher accuracy and lower complexity was ob-tained.Firstly we study the memory network used to solve text question answer-ing,and we improved it based on the idea of adversarial network and the attention mechanism.We propose to learn multimodal feature representations by finding a more coherent subspace for video clips and the corresponding texts(e.g.,subtitles and questions)based on generative adversarial networks.Moreover,a self-attention mechanism is developed to enforce our newly introduced consistency constraint in order to preserve the self-correlation between the visual cues of the original video clips in the learned multimodal representations,which can make the multimodal features maintain consistency.Extensive experiments on the benchmark Movie QA datasets show the effectiveness of our proposed AMN over other published state-of-the-art methods.To the best of our knowledge,AMN is the first work to introduce the generative adversarial framework for multimodal question answering.3.In order to make the model also have a good effect on story plots with longer con-text intervals while solving the problem of multimodal fusion in video question answering,which has higher accuracy for questions that pay more attention to con- text details.Firstly We study the Two-Stream network which used in video clas-sification tasks,We optimized the backbone based on the method of multi-channel convolution,and placed the adversarial learning module of the AMN network in the Two-Stream network.Extensive experiments on the benchmark TVQA show the ef-fectiveness of our proposed AMN network4.In order to explore the temporal correlation between visual features and text features in our proposed AMN method,We use the PCA and t-SNE to analyze the modal feature distribution map,Moreover,we also use the confusion matrix and other data visualization methods to explore the multi-modality in the AMN network.
Keywords/Search Tags:Movie Question Answering, Adversarial Network, Multimodal Understanding, Attention Mechanism
PDF Full Text Request
Related items