With the rapid development of Internet technology,the research of natural language processing and computer vision has promoted the progress of artificial intelligence,resulting in Visual Question Answering(VQA).VQA is a task that aims to combine vision and language for computers to understand human language,so it is of great significance to study visual question answering.In order to improve the accuracy of VQA task answering questions,this paper introduces an attention mechanism into the VQA model and builds a multi-modal fine-grained fusion model to predict the correct answer.The specific work content and research results are as follows:First,for the problem of noise in the visual features extracted in visual question answering,a Residual Channel Self-attention Network(RCSNet)is proposed.The method uses the improved Res Net network to enhance the image to improve the accuracy of image attention;in addition,a new joint attention mechanism is proposed to combine word attention and image region attention to obtain more accurate object relation features.The experimental results show that the method can significantly improve the attention extraction ability of image features.Secondly,a Multimodal Chiastopic-Fusion Network(MCFNet)is proposed for the low accuracy of the visual question answering model in answering complex image questions.The network uses RCSNet to enhance image features,and constructs a cross-fusion network to integrate two dynamic information streams to obtain highly correlated multimodal features.The experimental results show that the accuracy of the network on the Test-dev of the VQA v1.0 dataset reaches 67.57%,which is 1.20%higher than that of the CAQT model,which verifies the effectiveness and robustness of the MCFNet model.Finally,in view of the disadvantage that traditional VQA tasks cannot fully capture the complex correlation between multimodal features,a Multimodal Transmission Attention Network(MTANet)is proposed.The network recalibrates the input features by fusing features from the intermediate layers;the fused features are focused on fine-grained parts of the image and question by overlapping computations on the transfer network.The experimental results show that the overall accuracy of the MTANet model on the Test-dev of the VQA v1.0 and VQA v2.0 datasets reaches69.91% and 68.59%,respectively,indicating that the MTANet model can effectively improve the accuracy of the VQA task. |