Font Size: a A A

Research On Visual Question Answering Model Based On Attention Mechanism

Posted on:2021-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y T WuFull Text:PDF
GTID:2518306107450264Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the past decade,computer vision and natural language filed have made great progress,the development of these two fields has promoted the research of multi-modal tasks.One of the representative tasks is the Visual Question Answering task with open features proposed by the academic community in 2015.Given an image and a natural language question about the image,the task is to provide an natural language answer.The visual question answering model requires a fine-grained understanding of both the visual content of the image and the text content of the question.However,at present most VQA models only use visual attention and ignore text attention.Moreover,when learning visual attention,most first learn the attention within a single mode,and then learn the attention of another mode on this mode.we believes that this method will lose those that are not important in the mode and are Another modal but critical information.In addition,most of them only use shallow attention networks,but shallow models cannot capture the high-level connection between pictures and text.Based on the above problems,This article proposes an improved maximum value-based attention network,referred to as MSGA,which uses two basic attention units,the self-attention unit and the guided attention unit to learn the information flow within the mode and the inter-modal information that other mode flows to this mode,and then use the idea of maximum value to take a larger value in the intra-modal and inter-modal attention features.Based on this network,three deep cascaded networks are proposed,two of which use stacking and the other one use encoder-decoder structure.The difference between the two stacking methods is whether to use MSGA to extract the attention features of the question.All three models can learn Intra-and Inter-modality information flow,thus significantly improving the performance of visual question answering.The model of this paper is evaluated on the VQA-v2 data set.The experimental results show that the worst single model result in this paper is also 0.32 higher than the DFAF model in 2019[2],and the best single model result is 0.23 higher than the single model[3]proposed in the 2019 VQA Challenge champion,reaching 71.13.
Keywords/Search Tags:VQA, maximum value-based attention network, stacking, encoder-decoder
PDF Full Text Request
Related items