Research On Visual Question-Answering Methods Based On Attention Mechanism

Posted on:2024-02-02

Degree:Master

Type:Thesis

Country:China

Candidate:N C Chen

Full Text:PDF

GTID:2568307157485064

Subject:Master of Electronic Information (Professional Degree)

Abstract/Summary:

Visual Question Answering(VQA)is an emerging and important subtopic in the integration of natural language processing and computer vision tasks.As an important part of Turing test,it lays a solid foundation for promoting the development of future general artificial intelligence.Aiming at the problem of semantic gap between different modalities,this paper combines attention mechanism and graph structure technology to research visual question answering algorithm.The main research contents of this paper are as follows:Aiming at the problem that the cross-modal interaction information between image and question is not fully learned in the existing visual question answering algorithms.This chapter proposes a visual question answering algorithm based on multi-level attention mechanism.The proposed method consists of three modules,which are feature extraction module,modal information interaction module,and multimodal fusion and output classification module.Firstly,the features of the image and the text are extracted respectively,and the deep modal interaction and mutual guidance between different modalities are carried out through multiple attention units such as self-attention and guided attention,so that the features with more information between different modalities are used for answer reasoning.The experimental results show that the proposed method can improve the accuracy of Number questions,which is always low,by about 0.61%,and the model in this chapter also gives satisfactory answers on other types of answers.Traditional visual question answering research does not fully understand the interactive information between objects in the image,ignores the dynamic relationship between image and text semantic information in dual-modality and the rich spatial structure between different regions.To solve these problems,a multi-module VQA model based on graph attention network was proposed.Graph neural network can rely on high-level text imageimage representation to continuously update the information between nodes,so that the model can fully understand the dynamic interaction between objects in the visual scene and the text context representation.The experimental results show that the proposed algorithm achieves an accuracy of 71.54% on Test-std,which can provide a powerful means for visual question answering algorithms.Aiming at the problem that the different contribution and influence between different nodes are not fully considered in the graph attention network model,the features of adjacent nodes are updated through the mechanism of attention weighting,so that prominent areas get higher weight values.Based on this,a graph convolution visual question answering method based on attention weighting is proposed.Compared with other VQA models on the dataset VQA2.0,the experimental results show that the algorithm has an accuracy of71.69%on Test-std,which can effectively improve the accuracy of visual question answering.

Keywords/Search Tags:

Visual question answering, Attention mechanism, Multi-modal reasoning, Graph network

Related items

1	Research On Visual Question Answering Algorithm Based On Spatial Attention Reasoning Mechanism
2	Research On Situational Reasoning Visual Question Answering Based On Graph Neural Network
3	Object-oriented Two-Stream Network And Heterogeneous Graph Reasoning On Video Question Answering
4	Research On Visual Question Answering Based On Deep Learning
5	Question-Guided Attention Reasoning Mechanism For Visual Question Answering
6	Complex Scene Reasoning Based On Multi-modal Attention Mechanism
7	Research On Visual Question Answering Method With Attention Reasoning Mechanism
8	Visual Question Answering Based On Deep Reasoning
9	Research On Visual Question Answering Based On Text Semantic Understanding
10	Research And Application Of Multi-domain Visual Question Answering System Based On Image Comprehension