Research On Visual Question Answering Method Based On Attention Mechanism

Posted on:2022-03-02

Degree:Master

Type:Thesis

Country:China

Candidate:Z Y Chen

Full Text:PDF

GTID:2518306350981889

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

Visual question answering(VQA)spans the two disciplines of computer vision and natural language processing.It mainly studies how to generate an answer that conforms to the rules of natural language based on a given image and an image-related question.VQA methods must understand images,texts and other modal information,and effectively integrate these information.Most of the traditional VQA methods focus on the processing of single mode and ignore the interaction between multiple modes,which leads to the low accuracy of answer prediction.In order to enhance the interaction of multiple modal information in VQA and improve the accuracy of answer prediction,this paper proposes an attention model,called Encoder-Decoder Attention(EDA)model,which is suitable for VQA methods.EDA model is composed of several basic adaptive self-attention units and adaptive guided-attention units,which can strengthen the interaction between various modes,use text features to guide and generate image features,and improve the accuracy of answer prediction of VQA method.In order to improve the problem of long training time and slow computing speed of EDA model,based on EDA model,this paper proposes another attention model,which is called Stacking Attention(SA)model,by changing the connection mode of underlying attention units and sacrificing some accuracy of answer prediction.Compared with EDA model,SA model has the same structure except the connection mode of attention unit.In addition,this paper also proposes a VQA method based on adaptive attention mechanism,which is called Multimodal Adaptive Attention Networks(MAAN).MAAN uses target detection network Faster R-CNN to extract image features and GRU to extract text features.After processing the two features by SA model or EDA model,the two features are effectively fused and transmitted to the classifier to predict the answer.MAAN can effectively handle the information interaction between the input image and the question,and generate accurate answers that conform to the rules of natural language.In this paper,we have done a lot of experiments on VQA v2.0 and Visual7 W.Experimental results show that both SA model and EDA model can effectively improve the accuracy of the VQA method.The accuracy of EDA model is higher than SA model,and the speed of SA model is faster than EDA model.In addition,in the process of comparing with other VQA methods,MAAN experimented with SA model and EDA model respectively.Whether using the SA model or the EDA model,MAAN achieved high accuracy,with the highest accuracy when using the EDA model,reaching 71.45% on VQA v2.0 and 65.3% on Visual7 W.

Keywords/Search Tags:

visual question answering, computer vision, natural language processing, attention mechanism

PDF Full Text Request

Related items

1	Research On Visual Question Answering Method With Visual Content Understanding And Text Information Analysis
2	Deep Convolutional Network And Regional Attention Network For Visual Question Answering
3	Research On Collaborative Attention Model And Deep Correlated Networks For Visual Question Answer
4	Research On Language Ambiguity Elimination Methods In Visual Question Answering (VQA)
5	Research On Visual Question Answering Method Based On Deep Learning
6	Research On Visual Question Answering Method Based On Scene Word Analysis
7	Research And Implementation Of Visual Question Answering System Based On Collaborative Attention Mechanism
8	Research Progress Analysis And Key Information Measurement Of Visual Question Answering
9	Research On Question Answering System Based On Attention Mechanism And Answer Verification
10	Research On Single-fact Knowledge Base Question Answering Based On Multi-aspect Attention Mechanism