Font Size: a A A

Research On Visual Question Answering Method Based On Attention Mechanism

Posted on:2022-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y ChenFull Text:PDF
GTID:2518306350981889Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Visual question answering(VQA)spans the two disciplines of computer vision and natural language processing.It mainly studies how to generate an answer that conforms to the rules of natural language based on a given image and an image-related question.VQA methods must understand images,texts and other modal information,and effectively integrate these information.Most of the traditional VQA methods focus on the processing of single mode and ignore the interaction between multiple modes,which leads to the low accuracy of answer prediction.In order to enhance the interaction of multiple modal information in VQA and improve the accuracy of answer prediction,this paper proposes an attention model,called Encoder-Decoder Attention(EDA)model,which is suitable for VQA methods.EDA model is composed of several basic adaptive self-attention units and adaptive guided-attention units,which can strengthen the interaction between various modes,use text features to guide and generate image features,and improve the accuracy of answer prediction of VQA method.In order to improve the problem of long training time and slow computing speed of EDA model,based on EDA model,this paper proposes another attention model,which is called Stacking Attention(SA)model,by changing the connection mode of underlying attention units and sacrificing some accuracy of answer prediction.Compared with EDA model,SA model has the same structure except the connection mode of attention unit.In addition,this paper also proposes a VQA method based on adaptive attention mechanism,which is called Multimodal Adaptive Attention Networks(MAAN).MAAN uses target detection network Faster R-CNN to extract image features and GRU to extract text features.After processing the two features by SA model or EDA model,the two features are effectively fused and transmitted to the classifier to predict the answer.MAAN can effectively handle the information interaction between the input image and the question,and generate accurate answers that conform to the rules of natural language.In this paper,we have done a lot of experiments on VQA v2.0 and Visual7 W.Experimental results show that both SA model and EDA model can effectively improve the accuracy of the VQA method.The accuracy of EDA model is higher than SA model,and the speed of SA model is faster than EDA model.In addition,in the process of comparing with other VQA methods,MAAN experimented with SA model and EDA model respectively.Whether using the SA model or the EDA model,MAAN achieved high accuracy,with the highest accuracy when using the EDA model,reaching 71.45% on VQA v2.0 and 65.3% on Visual7 W.
Keywords/Search Tags:visual question answering, computer vision, natural language processing, attention mechanism
PDF Full Text Request
Related items