| Visual Question Answering(VQA)is an emerging and important subtopic in the integration of natural language processing and computer vision tasks.As an important part of Turing test,it lays a solid foundation for promoting the development of future general artificial intelligence.Aiming at the problem of semantic gap between different modalities,this paper combines attention mechanism and graph structure technology to research visual question answering algorithm.The main research contents of this paper are as follows:Aiming at the problem that the cross-modal interaction information between image and question is not fully learned in the existing visual question answering algorithms.This chapter proposes a visual question answering algorithm based on multi-level attention mechanism.The proposed method consists of three modules,which are feature extraction module,modal information interaction module,and multimodal fusion and output classification module.Firstly,the features of the image and the text are extracted respectively,and the deep modal interaction and mutual guidance between different modalities are carried out through multiple attention units such as self-attention and guided attention,so that the features with more information between different modalities are used for answer reasoning.The experimental results show that the proposed method can improve the accuracy of Number questions,which is always low,by about 0.61%,and the model in this chapter also gives satisfactory answers on other types of answers.Traditional visual question answering research does not fully understand the interactive information between objects in the image,ignores the dynamic relationship between image and text semantic information in dual-modality and the rich spatial structure between different regions.To solve these problems,a multi-module VQA model based on graph attention network was proposed.Graph neural network can rely on high-level text imageimage representation to continuously update the information between nodes,so that the model can fully understand the dynamic interaction between objects in the visual scene and the text context representation.The experimental results show that the proposed algorithm achieves an accuracy of 71.54% on Test-std,which can provide a powerful means for visual question answering algorithms.Aiming at the problem that the different contribution and influence between different nodes are not fully considered in the graph attention network model,the features of adjacent nodes are updated through the mechanism of attention weighting,so that prominent areas get higher weight values.Based on this,a graph convolution visual question answering method based on attention weighting is proposed.Compared with other VQA models on the dataset VQA2.0,the experimental results show that the algorithm has an accuracy of71.69%on Test-std,which can effectively improve the accuracy of visual question answering. |