Font Size: a A A

Research On Visual Question Answering Based On Deep Learning

Posted on:2021-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:L H YuFull Text:PDF
GTID:2428330605950550Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Visual Question Answering(VQA)is a task that combines Natural Language Processing(NLP)with Computer Vision(CV)and aims to generate reasonable answer for the given image and the question.The key point lies that VQA requires a fully understanding of both the visual information of the image and the semantic information of the question.Therefore,this paper conducts in-depth research on the key issues of VQA based on the previous research.The specific contents of this paper are as follows:For the existing VQA research can't effectively use attention mechanisms and ignore the hidden relationship between image-question-answer triples,this paper proposed a VQA method based on multi-layer attention mechanism.By using FasterRCNN and Gated Recurrent Unit(GRU)to extract image visual features and problem semantic features,and using the Transformer model with multi-layer attention mechanism to realize multimodal information interaction,we obtain the semantic information of the answer and the fusion feature of image-question pair,and then predict the answer according to the acquired features.Experimental results demonstrate that the proposed method delivers 70.63% overall accuracy and significantly outperforms the previous state-of-the-art.For the existing VQA methods did not consider high-level semantic information in images,especially the interaction among objects or parts in the image,in this paper,we propose a novel method for VQA,with modeling both of co-occurrence relationship and high-level semantic relationship jointly among objects in images for generating high-level graph visual representations to capture question specific interactions.Our method contains two main parts,one is a question-guided object/part level image feature extractor to generate co-occurrence relationships in the image,the other is a visual relationship detector to extract semantic relationships in the image.After that,graph convolutional neural networks(GCN)is applied to generate graph representations and then feeds into the traditional VQA module to predict answers.The experiments on VQA-2.0 dataset achieve an accuracy of 67.3% demonstrate that our co-occurrence and visual semantic jointly relationship graph benefits to capture question specific interactions and achieves better performance than the previous graphbased methods.Finally,we summarize the work of this paper and introduce the future work...
Keywords/Search Tags:Visual Question Answering, Multi-Modal Fusion, Multi-layer Attention, High-level Semantics, Graph Convolutional Network
PDF Full Text Request
Related items