Research On Visual Question Answering Based On Modal Interaction

Posted on:2024-01-04

Degree:Master

Type:Thesis

Country:China

Candidate:Z Lei

Full Text:PDF

GTID:2568307061491824

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Visual Question Answering(VQA)aims to enable computers to automatically answer natural language questions related to images based on understanding images and language content,which promotes human-computer interaction and has long-term research prospects and great application value.In the existing visual question answering methods,the traditional feature joint embedding method directly stitches two modal features,which lacks information interaction between modalities,resulting in unsatisfactory results.In recent years,attention-based visual question answering methods have achieved promising results.However,traditional self-attention mechanisms rely entirely on pairwise similarity when exploring relations between objects within a single modality,making it impossible to provide prior knowledge to assist models in understanding images and questions for some questions that do not appear in the dataset during the answering process.Existing methods based on attention mechanisms only use the question features of a single semantic dimension to guide the feature interaction process,lacking multi-dimensional semantic information.In addition,the fusion of feature information in the fusion stage is insufficient,and there is information redundancy.This paper proposes a new attention-based method from the perspective of multimodal interaction to improve the answer accuracy of VQA tasks.The main research work of this paper is as follows:(1)A Visual Question Answering Method Based on Prior Knowledge Augmentation and Gated Interaction Attention is Proposed.Aiming at the problem that the traditional selfattention mechanism lacks prior knowledge in mining the relationship between objects in a single mode,a prior knowledge-enhanced attention module is constructed,and the prior knowledge vector is embedded in the self-attention mechanism,which is a single-mode The state information mining stage introduces prior knowledge.For the information redundancy problem in joint attention,a gated interactive attention module is constructed in the multimodal interaction stage to complete the information interaction between modalities,refine and integrate the interaction features.In addition,the method also designs a twostream fusion module to complete multimodal feature fusion in the fusion stage.(2)An encoder-decoder architecture for visual question answering based on a multilevel mesh interaction model is proposed.In order to further improve the accuracy of VQA answers,this method cleverly connects in a multi-level network,while using different levels of low-dimensional and high-dimensional question features to provide more differentdimensional question information for modal interaction.At the same time,the interactive attention module proposed in this chapter can simultaneously complete the intensive interaction between image features and question features in the form of a single module.Consider that for the same image,different questions focus on different objects.This method designs an adaptive multi-scale fusion module in the feature fusion stage to aggregate fusion features from different scales.In this paper,a large number of experiments are carried out on two general large-scale data sets in the field of VQA,and the experimental results verify the effectiveness of the model.

Keywords/Search Tags:

Visual Question Answering, Modal Interaction, Multimodal Fusion, Attention Mechanism

PDF Full Text Request

Related items

1	Research On Multimodal Deep Learning Algorithm Based On Attention Mechanism
2	Research On Visual Question Answering Method Based On Attention Mechanism And Multimodal Fusion
3	Research On Visual Question Answering Based On Deep Neural Network
4	Research On Multimodal Fusion For Visual Question Answering
5	Research On Visual Question Answer Algorithm Based On Attention Mechanism
6	Research On Multimodal Attention Mechanism And Information Fusion For Visual Question Answering
7	Research And Application Of Multi-domain Visual Question Answering System Based On Image Comprehension
8	Research On Multimodal Interaction Model And Optimization Method For Visual Question Answerin
9	Research On Visual Question-Answering Methods Based On Attention Mechanism
10	Research On Visual Question Answering Based On Multiple Attention Mechanism And Feature Fusion Algorithm