Visual Question Answering Based On Deep Reasoning

Posted on:2021-03-18

Degree:Master

Type:Thesis

Country:China

Candidate:F Liu

Full Text:PDF

GTID:2428330611465692

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Visual question answering(VQA)is a challenging task that involves common semantic understanding of multimodal information inputs(i.e.,image contents and natural language questions),and joint reasoning based on vision and language.Existing VQA models usually combine convolutional neural networks and recurrent neural networks to map images and questions to a common feature space.Since scene texts often convey key information,the machine must understand the text in the image.Different from general VQA,text-based VQA(T-VQA)requires "reading" the text in the image,and combining text and other visual contents to reason.For T-VQA,this thesis first proposes a two-stage reasoning model.In the first stage(StageI),we only use the recognized text in the image to answer the question.If the prediction in the first stage is not confident enough,then we turn to the second reasoning stage,which uses both visual and text features in the image.In order to establish the relationship between the vision and texts in the image,we design a cross-modal relationship graph,and further explore the relationship between question representation and attention mechanism.Experimental results demonstrate that our model can effectively extract and merge the high-level semantic features of visual and text features in images,thus improving the performance on Text VQA and STVQA datasets.Existing VQA methods regard answer prediction as a single-step classification problem,that is,selecting answers from a fixed answer space.In this way,complex answers cannot be generated.In order to address this issue,this thesis proposes to treat answer prediction as a text generation task,and iteratively predict answers containing multiple words based on the Transformer model.In addition,the performance of the model relies on the accuracy of text recognition.In this thesis,we introduce an auxiliary task to train the model with policy gradient to reduce dependence on text recognition,thereby enhancing the reasoning ability.Experiments show that the Transformer-based generative VQA model proposed in this thesis significantly outperforms the state-of-the art methods on multiple T-VQA datasets.

Keywords/Search Tags:

Visual Question Answering, Deep Reasoning, Graph Neural Networks, Attention Mechanism, Transformer

PDF Full Text Request

Related items

1	Question-Guided Attention Reasoning Mechanism For Visual Question Answering
2	Research On Situational Reasoning Question Answer Method Based On Deep Learning
3	Research On Visual Question Answering Based On Deep Neural Network
4	Visual Question Answering Of Sport Scenes Based On Graph Neural Networks
5	Research On Collaborative Attention Model And Deep Correlated Networks For Visual Question Answer
6	Research On Visual Question Answering Algorithm Based On Spatial Attention Reasoning Mechanism
7	Research On Visual Question Answering Based On Visual Attention
8	Research On Visual Question Answering Based On Deep Neural Network And Attention Mechanism
9	Research And Implementation Of Visual Question Answering System Based On Deep Learning
10	Research And Algorithm Implementation Of Efficient Visual Question Answering Based On Deep Learning