Font Size: a A A

Research On Visual Question Answering Method Based On Deep Learning

Posted on:2021-09-02Degree:MasterType:Thesis
Country:ChinaCandidate:M Q JiangFull Text:PDF
GTID:2518306122468664Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the accumulation of multimodal data and the rapid development of deep learning,the cross-modal learning task represented by visual question answering(VQA)has received extensive attention and research.Visual question answering refers to giving an image and a question in natural language,it requires reasoning over visual elements of the image to infer the correct answer.VQA is a challenging multi-modal learning task since it requires an understanding of both textual and visual modalities simultaneously.Therefore,the approaches used to represent the questions and images in a fine-grained manner play key roles in the performance.To obtain a fine-grained representation,this article designs the end-to-end deep neural network model based on the attention mechanism to jointly learn question and image features.The main work of this article includes:1.In order to solve the problem that the traditional co-attention mechanism that cannot accurately locate the important words in the question and related visual areas in the image,this article proposes the CAQT model.CAQT includes the co-attention mechanism that includes textual attention based on self-attention and question-guided visual attention.The textual attention based on self-attention can find important words in the question and obtain the discriminative question representation.Then,using the question feature to guide the visual attention calculation,the mechanism can locate image areas related to the question based on the text information.Besides,this article introduces question type in the CAQT model and divides the questions in the datasets VQA v1.0 and VQA v2.0 into 8 categories.This article introduces the question type by directly concatenating the one-hot encoding of the question type with the multimodal joint representation,which can make the model know the question type before answer prediction,reduce the search range of the answer,and thus improve the model performance.2.Since the features calculated by the attention module may not be related to the Query involved in the calculation,this article proposes the double attention mechanism.DAtt's attention module consists of two parts: textual-based double attention and visual-based double attention.The double attention mechanism can ensure that the features obtained by attention calculation are related to the Query involved in the attention calculation,and can focus on the input information related to the semantics of the question,thereby reducing the interference of irrelevant information.3.All the methods proposed in this paper are verified on the two benchmark datasets of VQA v1.0 and VQA v2.0.The co-attention mechanism and question type module in the CAQT model can improve the accuracy of answers.the textual-based double attention and visual-based double attention in the DAtt model can also improve the performance of the model.
Keywords/Search Tags:Visual question answering, co-attention, double attention, selfattention, question type
PDF Full Text Request
Related items