| In recent years,Visual Question and Answering(VQA)based on the fusion of image visual features and question text features has attracted wide attention from researchers.Most of the existing models use attention mechanism and intensive iterative operations for two-modal fine-grained interactive matching.Due to the limited representation ability of the network structure,the autocorrelation information of the image area and the question word is ignored,which leads to the overall semantic understanding.Bias,thereby reducing the accuracy of answer prediction.At the same time,we also observe that after multiple bilateral joint attention operations,some valuable but not valued edge information in the image is often completely forgotten.In addition,the current visual question answering model does not perform well on reasoning problems.In response to the above problems,this article explores the depth and width of the network structure.The main research work is as follows:In terms of depth,the area loss caused by the iterative attention mechanism in the model is repaired,so that the edge information of the image and the problem is preserved,and the performance is enhanced;in terms of width,the self-attention mechanism of a single mode and the two modes.The joint attention mechanism is combined to enhance the learning ability of the network architecture.Aiming at the phenomenon that the autocorrelation information of the image area and the question word is ignored,this paper constructs a model architecture based on the symmetrical attention mechanism,which can effectively use the semantic relationship between the image and the question,thereby reducing the overall semantic understanding deviation.In order to improve the accuracy of answer prediction.The model is tested on the VQA2.0 data set.The experimental results show that the model based on the symmetrical attention mechanism has obvious advantages compared with the baseline model.Aiming at the problem of edge information being ignored after multi-layer joint attention operation,a novel composite attention network with original feature injection is proposed to use bilateral information and autocorrelation information in the overall depth framework.The visual function enhancement mechanism aims to explore more complete visual semantics and avoid understanding deviations.Then an original feature injection module is proposed to preserve the neglected edge information in the image.A large number of experiments on the VQA2.0 database have proved the effectiveness of this method.Aiming at the inaccurate answers to the model reasoning questions,the influence of graph structure on the visual question answering network is explored.In order to better mine the correlation between regions,we can provide structured relational knowledge-understand complex content and deduce related relations through structured knowledge based on multimodal and multi-cognitive scales,and then combine various data Reasoning(such as scene graphs in images)to solve the problem of insufficient cross-media(visual and text)reasoning capabilities.Corresponding experiments on the VQA2.0 and CLEVR data sets have proved the validity of the graph structure. |