Font Size: a A A

Research On Collaborative Attention Model And Deep Correlated Networks For Visual Question Answer

Posted on:2021-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:L LiFull Text:PDF
GTID:2428330611462515Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,the multi-modal task visual question answer combining computer vision and natural language processing has attracted more and more attention of researchers.Unlike cross-modal task image caption,which simply describes the main content of the image in one sentence,the visual question answer task is designed to allow the machine to automatically answer natural language questions related to the input image.It involves multi-modal content information understanding and requires Extract and analyze image question data to infer the correct answer.The task has higher requirements for fine-grained understanding of the model's image.The key to visual question answer is the common semantic understanding of visual images and natural language,and the joint guidance and joint reasoning between visual and semantics.Attention mechanism is an effective way to achieve multimodal association.However,the existing visual question answering methods still have many problems.Based on these existing problems,this paper further explores the attention mechanism in visual question answering,and makes some improvements to the attention network.The main work and innovations of this article are summarized as follows:(1)Combined with Multi-view Attention Network for Visual Question Answer.This paper proposes a visual question answer model based on multi-perspective attention mechanism.In visual question answer tasks,there are multiple semantic and image expressions.In particular,there are some problems that require models to understand the semantic expressions between multiple target objects in an image.Therefore,a single visual attention model cannot effectively mine the association between different semantic objects in the image and the semantics of the problem.The multi-view attention mechanism network proposed in this paper can filter the information of different perspectives on images,and effectively focus on the parts of all images that need attention.The model uses different attention modules in the upper and lower layers to jointly focus on calculating image weights and performing joint weighting.This paper has achieved good results on the public data set VQA v2.0.(2)Self-correlated and Interaction-guided Attention Mechanism for Efficient Visual Question Answer.This paper proposes a v visual question answer model that combines autocorrelation and interactive guided attention mechanisms.The model establishes the autocorrelation attention modules in the modalities of the problem and the visual image,and then the "question-image" and "image-question" interactive guided attention modules were established through the semantic guidance of data between different modalities,which effectively enhanced the high-level semantic interaction between visual image information and text problem information,thereby improving the model's Overall generalization capabilities.Optimized the establishment of information flow interactions between modalities.The experimental results and ablation analysis show that the visual question answer model proposed in this paper can predict the visual question answer results more accurately,and has good robustness and scalability.
Keywords/Search Tags:Visual Question Answering, Attention Mechanism, Deep Learning, Natural Language Processing, Question Answering System
PDF Full Text Request
Related items