Research On Multimodal Fusion For Visual Question Answering

Posted on:2023-06-30

Degree:Master

Type:Thesis

Country:China

Candidate:M Wang

Full Text:PDF

GTID:2568306785464254

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology,the research of natural language processing and computer vision has promoted the progress of artificial intelligence,resulting in Visual Question Answering(VQA).VQA is a task that aims to combine vision and language for computers to understand human language,so it is of great significance to study visual question answering.In order to improve the accuracy of VQA task answering questions,this paper introduces an attention mechanism into the VQA model and builds a multi-modal fine-grained fusion model to predict the correct answer.The specific work content and research results are as follows:First,for the problem of noise in the visual features extracted in visual question answering,a Residual Channel Self-attention Network(RCSNet)is proposed.The method uses the improved Res Net network to enhance the image to improve the accuracy of image attention;in addition,a new joint attention mechanism is proposed to combine word attention and image region attention to obtain more accurate object relation features.The experimental results show that the method can significantly improve the attention extraction ability of image features.Secondly,a Multimodal Chiastopic-Fusion Network(MCFNet)is proposed for the low accuracy of the visual question answering model in answering complex image questions.The network uses RCSNet to enhance image features,and constructs a cross-fusion network to integrate two dynamic information streams to obtain highly correlated multimodal features.The experimental results show that the accuracy of the network on the Test-dev of the VQA v1.0 dataset reaches 67.57%,which is 1.20%higher than that of the CAQT model,which verifies the effectiveness and robustness of the MCFNet model.Finally,in view of the disadvantage that traditional VQA tasks cannot fully capture the complex correlation between multimodal features,a Multimodal Transmission Attention Network(MTANet)is proposed.The network recalibrates the input features by fusing features from the intermediate layers;the fused features are focused on fine-grained parts of the image and question by overlapping computations on the transfer network.The experimental results show that the overall accuracy of the MTANet model on the Test-dev of the VQA v1.0 and VQA v2.0 datasets reaches69.91% and 68.59%,respectively,indicating that the MTANet model can effectively improve the accuracy of the VQA task.

Keywords/Search Tags:

Visual question answering, Attention mechanism, ResNet network, Question feature, Image feature, Multimodal feature

PDF Full Text Request

Related items

1	Research On Visual Question Answering System Based On Image Attention
2	Research On Visual Question Answering Based On Deep Neural Network
3	Research On Visual Question Answering Based On Attention Mechanism And Image Global Feature Injection
4	Research On Visual Question Answering Model Based On Attention Mechanism And Feature Fusio
5	Visual Question And Answering Based On Two-dimensional Multi-attention Feature Fusion
6	Research On Visual Question Answering Algorithm Based On Feature Fusion Of Attention Mechanism
7	Research On Visual Question Answering Method And System Based On Deep Learning
8	Research On Visual Question Answering Based On Multiple Attention Mechanism And Feature Fusion Algorithm
9	Research On Methods Of Visual Question Answering Based On Adaptive Multimodal Feature Fusion
10	Research On Visual Question Answering Method Based On Attention Mechanism And Multimodal Fusion