| With the continuous development of computer vision and natural language processing,visual question answering combining knowledge in these two fields has also developed into an important research direction in the field of computer science.The goal of the visual question answering is to input a given image and question,so that the computer can combine the information contained in the image and text to generate a natural language as the output answer.This task requires multimodal understanding and reasoning ability(image and text).Most visual question answering methods are end-to-end learning systems,which regard visual question answering as a classification task.First,the pre-trained CNN is used to process images and RNN is used to process text,and then the two features are combined through a variety of techniques to predict the answer.The graph network has shown strong ability in classification and reasoning ability,but the two types of Euclidean domain data of image and text can not directly use graph convolutional networks.It is necessary to express image and text features as data of graph structure type.At the same time,the graph network may have oversmoothing problems during training,and the discrimination of nodes in the learning process decreases,which affects the learning effect.In view of the above problems,this paper takes graph convolutional networks as the research object,proposes to use the feature of multiple target instances of the image as the node of the graph,and the Euclidean distance between each node as the adjacency matrix of the graph data.At the same time,the graph network is improved,and self-connections are added in the forward propagation process of the graph convolutional network to enhance the distinction of nodes in the graph,and regularization terms are added to reduce the over-fitting problem.In terms of feature extraction,Faster-RCNN is used to extract the regional features of the image target level,and then Glove is used to encode the problem into a sequence of word vectors,and finally the sequence of word vectors is sent to the GRU to extract the problem features.Visual and text features are fused into graph-type data,and the final answer is classified after learning by graph convolutional layer.In this paper,the method of fusion graph convolutional network is adopted to deal with the visual question answering task.The experiment is carried out in VQA2.0 dataset,and the average answer prediction accuracy is 66.63%.The improved GCN inter-layer propagation method is used to optimize the network,which increases the accuracy rate by 0.21%.Compared with the classical method,this method has higher prediction accuracy,which verifies the effectiveness of the method in the visual question answering task. |