Font Size: a A A

Research On Visual Question Answering Based On Attention Mechanism And Image Global Feature Injection

Posted on:2024-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y B LouFull Text:PDF
GTID:2568307115998879Subject:Electronic Information (Computer Technology) (Professional Degree)
Abstract/Summary:PDF Full Text Request
Visual Question Answering,as an advanced visual task,aims to accurately answer natural language questions related to a given image.It has broad application prospects in fields such as medical rescue,human-machine interaction,intelligent customer service,and search engines.As important research in the field of artificial intelligence,the VQA task involves natural computer vision and natural language processing.It requires not only visual reasoning of visual content and fine-grained semantic understanding of text,but also a deep understanding of the relationship between modalities to predict the correct answer.Therefore,VQA is a highly challenging task.Based on existing research results,this paper has done the following research work:(1)To address the problem of poor noise filtering ability of the self-attention mechanism in existing visual question answering models,a visual question answering model based on the multimodal gate self-attention(MGSA)mechanism is proposed in this paper.In this model,other modality features are utilized as channel gating in the selfattention module to filter the output of target modality feature self-attention learning.Meanwhile,the cross-modal bidirectional attention mechanism and stacked attention modules are combined to jointly learn co-attention and deep attention.Finally,the rich attention results of image and question features are fused,and the prediction result is obtained through the classification network.By using the gate self-attention mechanism,cross-modal bidirectional attention mechanism,and stacked attention module,the model has strong ability to filter noise information,strengthen the extraction of key information in the data,and improve the accuracy of model prediction.(2)By further analyzing the image input,a visual question answering model based on spatial relation aggregation and image global feature injection(IGFI)is proposed to explore the modeling and reasoning of the model on the relationship between the information transmitted by the image.The model utilizes spatial relation aggregation of image region features to form image global features.Only the regions with the highest correlation are aggregated by computing the correlation matrix to reduce the computation cost.Then,the image global features are jointly injected into the network containing interlayer aggregation for attention learning.Next,a bilateral gating mechanism is introduced to balance the contributions between visual region features and visual global features based on different questions.Finally,the fused features are used to obtain the prediction result through the classification network.By using spatially aggregated image global features as a supplement to answer questions,the model can better understand the modeling and reasoning of relationships in the image,thus effectively improving the accuracy of model prediction.(3)This paper evaluates the proposed models on the VQA 2.0、VQA-CP 2.0 and GQA datasets.Experiments such as ablation and visualization have proved the effectiveness of each module in the proposed model,and the overall performance is better than the mainstream model.
Keywords/Search Tags:Multimodal, Visual Question Answer, Self-Attention, Spatial Relation Aggregation, Bilateral Gating Mechanism
PDF Full Text Request
Related items