Research On Visual Question Answering Model Based On Attention Mechanism

Posted on:2021-07-21

Degree:Master

Type:Thesis

Country:China

Candidate:Y T Wu

Full Text:PDF

GTID:2518306107450264

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In the past decade,computer vision and natural language filed have made great progress,the development of these two fields has promoted the research of multi-modal tasks.One of the representative tasks is the Visual Question Answering task with open features proposed by the academic community in 2015.Given an image and a natural language question about the image,the task is to provide an natural language answer.The visual question answering model requires a fine-grained understanding of both the visual content of the image and the text content of the question.However,at present most VQA models only use visual attention and ignore text attention.Moreover,when learning visual attention,most first learn the attention within a single mode,and then learn the attention of another mode on this mode.we believes that this method will lose those that are not important in the mode and are Another modal but critical information.In addition,most of them only use shallow attention networks,but shallow models cannot capture the high-level connection between pictures and text.Based on the above problems,This article proposes an improved maximum value-based attention network,referred to as MSGA,which uses two basic attention units,the self-attention unit and the guided attention unit to learn the information flow within the mode and the inter-modal information that other mode flows to this mode,and then use the idea of maximum value to take a larger value in the intra-modal and inter-modal attention features.Based on this network,three deep cascaded networks are proposed,two of which use stacking and the other one use encoder-decoder structure.The difference between the two stacking methods is whether to use MSGA to extract the attention features of the question.All three models can learn Intra-and Inter-modality information flow,thus significantly improving the performance of visual question answering.The model of this paper is evaluated on the VQA-v2 data set.The experimental results show that the worst single model result in this paper is also 0.32 higher than the DFAF model in 2019^[2],and the best single model result is 0.23 higher than the single model^[3]proposed in the 2019 VQA Challenge champion,reaching 71.13.

Keywords/Search Tags:

VQA, maximum value-based attention network, stacking, encoder-decoder

PDF Full Text Request

Related items

1	Research On Encoder-Decoder Model For Complex Structure Text Recognition
2	Visual Data Understanding Based On Deep Encoder-Decoder Framework
3	Research And Application Of Self-attention Mechanism In Semantic And Sentiment Analysis
4	Research On Time Series Classification Based On Spectrum Attention Mechanism And Encoder-Decoder Model
5	Research Of Scene Text Recognition Based On Encoder-decoder Architecture
6	Research On Real-time Semantic Segmentation Based On Lightweight Encoder-decoder
7	EAD-OHMER:Research On Online Handwritten Mathematical Expression Recognition Based On Encoder-Decoder
8	Research On Encoder-Decoder Based Two-Dimensional Structural Text Recognition
9	Otns Decoding Module Of Network Chip Design
10	Encoder-decoder Model For Multi-aspect Sentiment Classification