Font Size: a A A

Visual Question Answering Based On Deep Reasoning

Posted on:2021-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:F LiuFull Text:PDF
GTID:2428330611465692Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Visual question answering(VQA)is a challenging task that involves common semantic understanding of multimodal information inputs(i.e.,image contents and natural language questions),and joint reasoning based on vision and language.Existing VQA models usually combine convolutional neural networks and recurrent neural networks to map images and questions to a common feature space.Since scene texts often convey key information,the machine must understand the text in the image.Different from general VQA,text-based VQA(T-VQA)requires "reading" the text in the image,and combining text and other visual contents to reason.For T-VQA,this thesis first proposes a two-stage reasoning model.In the first stage(StageI),we only use the recognized text in the image to answer the question.If the prediction in the first stage is not confident enough,then we turn to the second reasoning stage,which uses both visual and text features in the image.In order to establish the relationship between the vision and texts in the image,we design a cross-modal relationship graph,and further explore the relationship between question representation and attention mechanism.Experimental results demonstrate that our model can effectively extract and merge the high-level semantic features of visual and text features in images,thus improving the performance on Text VQA and STVQA datasets.Existing VQA methods regard answer prediction as a single-step classification problem,that is,selecting answers from a fixed answer space.In this way,complex answers cannot be generated.In order to address this issue,this thesis proposes to treat answer prediction as a text generation task,and iteratively predict answers containing multiple words based on the Transformer model.In addition,the performance of the model relies on the accuracy of text recognition.In this thesis,we introduce an auxiliary task to train the model with policy gradient to reduce dependence on text recognition,thereby enhancing the reasoning ability.Experiments show that the Transformer-based generative VQA model proposed in this thesis significantly outperforms the state-of-the art methods on multiple T-VQA datasets.
Keywords/Search Tags:Visual Question Answering, Deep Reasoning, Graph Neural Networks, Attention Mechanism, Transformer
PDF Full Text Request
Related items