Font Size: a A A

Research On Visual Question Answering Based On Visual Attention

Posted on:2019-11-19Degree:MasterType:Thesis
Country:ChinaCandidate:H B LiuFull Text:PDF
GTID:2428330545951222Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Visual Question Answering(VQA)is one of the popular research directions in recent years,covering two major fields of computer vision and natural language processing.It has attracted great attention of researchers.In this paper,we extract the salient features of visual and textual information by simulating human's attention mechanism and construct a multimodal fusion model for infering the VQA answers.We study on VQA from three aspects: visual attention mechanism,visual and textual co-attention mechanism and enhanced co-attention mechanism that incorporates the concept of visual semantics.The main research work is as follows:(1)The earlier VQA methods ultilized the global image feature which leads to the loss of the spatial information of the image.As a result,the models are unable to effectively understand the fine-grained image features.To address the problem,we propose Spatial Information Enhanced Attention Networks for Visual Question Answering.The method extracts the image features of the middle layer with spatial information through deep Convelutional Neural Networks(CNN),and feeds the features into Bi-directional Long Short-Term Memory(Bi-LSTM)network to enhance the context-aware information of image regions,and then introduces the region-based single-modality Attention-Based Attention(LBA)model to extract significant image features.The salient region features of the image are obtained by the initial weighted feature vector of the image.Bi-LSTM is used to extract the semantic features of the question and is integrated with the initial weighted features of the image to obtain visual attention network guidance information.The generalization ability of the single-layer visual attention network is often insufficient.We uses a stack of multi-layer attention networks to enhance the model's ability to reason and predict when it comes to complex input.The experiment manifests that,compared with the characteristics of global image,the spatial feature expression of the middle layer is stronger.And the multi-layer attention network effectively enhances the spatial information of the image feature,which significantly improves the performance of the VQA.(2)In view of the fact that most of the VQA methods only use the single-modal visual attention mechanism and ignore the importance of the text attention mechanism for extracting problem semantic information,we propose a multi-modal cross-guided collaborative attention network to address the problem.This method uses robust object detection model combined with CNN to extract image features based on Region Proposal(RP)and uses Bi-directional Gated Recurrent Units(Bi-GRU)to extract the high-level semantic features of the problem through feedforward and feedback GRU.Then the LBA model is used to extract the salient features of the image and problem to obtain the initial weighted feature vector for each region of the image and the initial weighted feature vector for each word in the problem.In order to enhance the expression ability of the attention model,the method introduces a new nonlinear activation method in the multimodal attention model,adopts a cross-guided fusion strategy to construct a multi-modal cross-guiding coordinated attention network and predicts answer by reasoning.The experiment shows that the multi-modal cross-guiding cooperative attention network can fully extract and utilize the significant features of vision and text,and the nonlinear activation method can effectively improve the expression ability of the model,thereby improving the performance of the model on the VQA data set.(3)To address the problem of a certain semantic gap between the low-level image features and high-level problem semantic features,We propose a visual question-and-answer method based on the concept of visual semantics to enhance collaborative attention networks.This method adopts the target detection method to extract the visual semantic concept in the image and introduces the semantic attention mechanism to select the visual semantic concept related to the problem.In order to fully extract the high-level semantic features of the problem,the method adopts a hierarchical structure to extract the semantic features of the problem from the lower layer to the middle layer and the high layer and uses the serialized coordinated attention model to extract the salient features of the semantic concepts of images,problems and vision in each layer.Finally,a multi-layered feed-forward network is used to fuse the weighted feature vectors obtained in the hierarchical structure together to form a distinctive feature vector for predicting the answers.The experiment shows that this method can effectively narrow the semantic gap between image features and high-level semantic features of the problem.At the same time,the hierarchical structure has strong ability of problem semantic extraction and improves the performance of VQA.
Keywords/Search Tags:visual question answering, visual attention, co-attention mechanism, semantic attention, convolutional neural networks
PDF Full Text Request
Related items