Font Size: a A A

Research On Visual Question Answering Based On Attention Mechanism And Relation Extraction

Posted on:2022-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y T MaFull Text:PDF
GTID:2518306725493134Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Computer vision has received extensive attention in recent years,and its subtasks mainly include image classification,object detection,visual question answering and video understanding,etc.Convolutional neural networks have enabled the rapid development of computer vision,and also promoted the progress of image visual question answering.The purpose of visual question answering is to answer questions about images.For the same image,there are often completely different types of questions,and the answers to the same question are different on different images.Due to the semantic variability of images and questions,VQA model not only needs to understand images and questions,but also needs to fuse information between different modes.The VQA model needs to find out the information corresponding to the question in the image on the basis of understanding the image and the question respectively.And then process the information under the semantic context of the question to get the correct answer.Therefore,there are two important links to solve the problem of VQA,one is the feature extraction of image and question,the other is the fusion of the two features and the reasoning of the position,size and other relations among the corresponding objects in the image.In the process of visual feature extraction,the semantic and attribute labels of the image and the objects in the image,as a kind of prior knowledge,play an important role.Therefore,the VQA method often uses the pre-training model that pretrained on image classification and detection task as the image feature extraction network.At the same time,the diversity of question semantics also makes that for some single question or semantic objects in the question,there may be only a few images or image regions corresponding to them.In other words,the model should be able to adapt to the image feature extraction under the condition of small samples.After obtaining the semantic information of the image and the question,the VQA model also needs to be able to understand the position and size relationship among the semantic objects in the image under the semantic context of the question to obtain complex semantic relations in the image.This requires the model to be able to find out the mapping relationship between the semantic objects in the question and the image feature to fuse the information between different modalities,and infer the characteristics of the relationship between the objects.In order to solve these difficulties faced by image visual question answering,this paper proposes a group-aware image classification model,which can extract the overall characteristics of the category through the learning of a small number of samples;for the two difficulties of image question answering: information interaction between different modalities and relational reasoning among objects,an image visual question answering model combining attention mechanism and multi-scale relational reasoning is proposed.The specific work is as follows:1.In order to better extract the features in the image,this paper explores the task of image classification under the condition of small samples.Few-shot image classification has the problem of sparse samples.This paper proposed a group-aware pruning method for few-shot learning,which is based on Res Net and generates residual block pruning strategy for each image in the support set with reinforcement learning.And then the Strategy Consensus Module merges the strategies of all samples to form a fixed pruning strategy.Then fine-tune the pruned classification network on the support set to avoid the negative impact of pruning,and finally test the pruned classification network directly on the test set.The proposed method optimizes the search space of the model by pruning the residual block of Res Net,which not only preserves the ability of extracting category features and avoiding over-fitting,but also reduces the calculation parameters and speeds up the inference speed.The proposed method achieves the highest accuracy on the 5-way5-shot task of the mini Image Net and Omniglot datasets.Especially on mini Image Net,it has achieved a precision improvement of 4.94%,and a speed increase of 14.9%.These experiments verify the good performance of this algorithm on few-shot image classification tasks.2.Aiming at the problems of multi-modal fusion and objects relational reasoning in image visual question answering,this paper proposes an image visual question answering model that combines attention mechanism and relation extraction.The proposed method uses Faster R-CNN method based on Res Net to extract the regional features of the image,and uses the Gated Recurrent Unit(GRU)to extract the features of the question.Then,the model uses the question-oriented attention mechanism to enhance the visual region features related to the problem,and uses the local multi-scale relational extraction to extarct relationship information of the related regions.Finally,the relationship information,visual information and question information are fused to obtain the answer to the question of the image.The proposed method can well extract the question-related regions and information in the image,and the multi-scale mechanism can also bring significant improvements to questions of “Numbers”.The method in this paper has achieved overall performance improvement on the two test sets of the VQA v2 dataset,test-dev and teststd,especially for questions of “Number”,which have achieved 1.14% and 0.95% respectively.
Keywords/Search Tags:CNN, Visual Question Answering, Attention Mechanism, Relation Extraction, Network Pruning
PDF Full Text Request
Related items