Font Size: a A A

Research On Visual Question Answering Based On Modal Interaction

Posted on:2024-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z LeiFull Text:PDF
GTID:2568307061491824Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Visual Question Answering(VQA)aims to enable computers to automatically answer natural language questions related to images based on understanding images and language content,which promotes human-computer interaction and has long-term research prospects and great application value.In the existing visual question answering methods,the traditional feature joint embedding method directly stitches two modal features,which lacks information interaction between modalities,resulting in unsatisfactory results.In recent years,attention-based visual question answering methods have achieved promising results.However,traditional self-attention mechanisms rely entirely on pairwise similarity when exploring relations between objects within a single modality,making it impossible to provide prior knowledge to assist models in understanding images and questions for some questions that do not appear in the dataset during the answering process.Existing methods based on attention mechanisms only use the question features of a single semantic dimension to guide the feature interaction process,lacking multi-dimensional semantic information.In addition,the fusion of feature information in the fusion stage is insufficient,and there is information redundancy.This paper proposes a new attention-based method from the perspective of multimodal interaction to improve the answer accuracy of VQA tasks.The main research work of this paper is as follows:(1)A Visual Question Answering Method Based on Prior Knowledge Augmentation and Gated Interaction Attention is Proposed.Aiming at the problem that the traditional selfattention mechanism lacks prior knowledge in mining the relationship between objects in a single mode,a prior knowledge-enhanced attention module is constructed,and the prior knowledge vector is embedded in the self-attention mechanism,which is a single-mode The state information mining stage introduces prior knowledge.For the information redundancy problem in joint attention,a gated interactive attention module is constructed in the multimodal interaction stage to complete the information interaction between modalities,refine and integrate the interaction features.In addition,the method also designs a twostream fusion module to complete multimodal feature fusion in the fusion stage.(2)An encoder-decoder architecture for visual question answering based on a multilevel mesh interaction model is proposed.In order to further improve the accuracy of VQA answers,this method cleverly connects in a multi-level network,while using different levels of low-dimensional and high-dimensional question features to provide more differentdimensional question information for modal interaction.At the same time,the interactive attention module proposed in this chapter can simultaneously complete the intensive interaction between image features and question features in the form of a single module.Consider that for the same image,different questions focus on different objects.This method designs an adaptive multi-scale fusion module in the feature fusion stage to aggregate fusion features from different scales.In this paper,a large number of experiments are carried out on two general large-scale data sets in the field of VQA,and the experimental results verify the effectiveness of the model.
Keywords/Search Tags:Visual Question Answering, Modal Interaction, Multimodal Fusion, Attention Mechanism
PDF Full Text Request
Related items