Visual Question Answering(VQA)has become a popular research direction at present.The traditional visual question answering model completes answer prediction by simply fusing visual features and text features,and does not solve the problem of semantic gap between input features of different modalities.At present,many studies have used attention mechanism in models to interact information within and between modalities of information,but most models still suffer from insufficient interaction between modal information and loss of high-level semantics of multi-modal information.In order to solve the existing problems,this thesis firstly builds a two-way interactive attention mechanism,and completes the cross-modal semantic association by simultaneously performing the two-way interactive attention of vision-guided text and text-guided vision.Based on this,a cross-modal interaction module is constructed.At the same time,an image self-attention module and a text self-attention module are constructed by using the selfattention mechanism,and the intra-modal information interaction is carried out for the two different modal features of the input respectively,so as to fully capture the correlation between the features in each region of the image and the internal correlation between each word in the question text.By stacking the above modules,a visual question answering model is built to complete the VQA task combined with the feature extraction network and feature fusion method.In order to further improve the prediction performance of the model,Faster Region-Convolutional Neural Networks(Faster R-CNN)is used in the model to obtain visual features,because it can filter out specific objects from the image through bottom-up attention,which is conducive to the use of attention modeling.The built model shows high prediction accuracy in the experiment,and the ablation analysis proves that the three attention modules constructed play a positive role in the model prediction,especially the cross-modal interaction module,which shows the superiority of the bidirectional interactive attention mechanism.The attention mechanism used above focuses on modeling global dependencies,ignoring the acquisition of local dependencies of image features.To solve this problem,a dynamic routing attention mechanism is proposed,which limits the range of receptive fields by adding adjacency mask to the self-attention method,and then dynamically selects different receptive fields by routing probability.In the image self-attention module,the dynamic routing attention mechanism is applied to complete the local relationship modeling of the features,and the pretrained Convolutional Neural Networks(CNN)are used to process the input visual information,so that the model can directly apply attention mechanism to the grid features,thus speeding up the calculation of the model.Experiments show that grid features are more suitable for the optimized model than the original candidate region features,which further improves the prediction accuracy of the model. |