Font Size: a A A

Attention Mechanism And High-level Semantics For Visual Question Answering

Posted on:2020-12-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:D F YuFull Text:PDF
GTID:1368330572978908Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the arising of big data,high-performance computation tool and deep learn?ing technology,Artificial Intelligence(AI)are witnessing its boom for the third time in history.Deep learning model has made breakthrough progress in many fields re-cently,such as computer vision,natural language processing,speech recognition and so on.In many specific tasks,such as object recognition,machine translation,question answering,deep learning based methods achieved human-level performance on some challenging datasets.However,human with advanced intelligence usually performs multi-modal perception and reasoning to make the decision in a more complex envi-ronment.Multi-modal tasks based on Vision-and-Language have drawn attention from more and more researchers in recent years,such as image captioning,visual storytelling and visual question answering.Different from the conventional image annotation tasks,image captioning and visual storytelling tasks aim at describing the image content with a single sentence or paragraph.This requires visual understanding and generating natural language which is semantically consistent with the image.Visual Question Answering(VQA)is aimed at empowering a machine to answer natural language questions referred to a visual image automatically,which involves multi-modal inputs(i.e.visual image and natural language questions)and fine-grained understanding of the image content.The key of VQA lies in semantic understanding of visual image as well as natu-ral language question and joint reasoning between them.The attention mechanism is an effective way to perform multi-modal reasoning.The attention mechanism plays three main roles in VQA.First,attention is able to locate and extract useful informa-tion queried by the asked question.Second,attention allows visual grounding between visual image and natural language,which is important to perform fine-grained reason-ing.Third,attention makes the VQA model more interpretable by visualization of the attention map.The utility of High-level semantics can be divided into two-fold:on one hand,high-level semantics from the image bridge the semantic gap between mul-timodal information,therefore help to reason in common semantic space;on the other hand,high-level semantics can be considered as human-interpretable explanation,and provide evidence for the answer reasoning and diagnosis of VQA system.However,there are two issues in existing attention based VQA model.First,existing works usu-ally perform attention on single-level representation,which fails to provide effective information required by complex and diverse questions.Second,most current attention models calculate the attention weights of image regions independently and completely ignore the contextual information among the objects,which leads to failure in answering relation-based questions.To solve these issues mentioned above,we make an extensive survey on the attention mechanism in visual question answer,and propose to apply attention mechanism to the multi-level representation of the image for more effective information extraction,understanding as well as reasoning based on the question.We summarized our work and major novelty as follows:Multi-level Attention Networks for Visual Question Answering We propose the novel multi-level attention network for VQA.Existing attention-based VQA models usually extract low-level visual information to infer the answer,ignoring the modeling of high-level semantics and spatial relationship in the image.Our proposed multi-level attention networks try to extract multi-level information from the image and then distill,merge and joint reason over these representations through the attention mechanism.In this way,the proposed model can reduce the domain gap by the semantic attention mod-ule and perform fine-grained spatial reasoning by the visual attention module.Besides,we model the visual relationship among the local regions of the image by a bidirectional GRU layer in order to encode the contextual information of each region.We achieve the state-of-arts in two most challenging VQA datasets when publishing our work.Multi-source Multi-level Attention Networks for Visual Question Answering We propose the multi-source multi-level attention networks to compensate for the drawback of multi-level attention networks.First,multi-level attention model only extracts infor-mation from the image itself at different levels,while some questions in VQA require knowledge-based reasoning.Second,in multi-level attention networks,bidirectional GRUs learn the spatial relationship after reshaping the image regions into 1-Dimension sequence structure,which destroyed the 2-Dimensional structure in the image.The major novelty in our proposed multi-source multi-level attention model lies in three-fold.First,our new model introduces the external knowledge sources as well as the multi-modal information from vision and language,which enables our VQA system knowledge-based reasoning.Second,the proposed 2D-GRUs model the visual rela-tionship in 2-Dimension and 4-Direction,which is consistent with the inherent struc-ture of the visual image.Third,we achieve significantly better results compared to the multi-level attention model.Graph Attention Networks for Visual Question Answering We propose the graph attention network to alleviate two kinds of deficiency in multi-source multi-level atten-tion networks.First,multi-source multi-level attention model encodes image features from the last convolution layer,where the receptive fields correspond to a uniform grid of equally-sized image regions,inconsistent with the multi-scale nature of objects.Sec-ond,multi-source multi-level attention model pools the visual features weighed with attention scores,which discards the location information of local regions.To solve these two issues,our graph attention networks model the relationship of objects as graph structure and then perform attention on both nodes and edges.Finally,we embed the attentive graph into a vector to solve the objects information fusion problems.Explainable Visual Question Answering using Attributes and Captions We pro-pose to break up the end-to-end VQA into two steps:explaining and reasoning,in an attempt towards a more explainable VQA by shedding light on the intermediate results between these two steps.To that end,we first extract attributes and generate descrip-tions as explanations for an image.Next,a reasoning module utilizes these explanations in place of the image to infer an answer.The advantages of such a breakdown include:(1)the attributes and captions can reflect what the system extracts from the image,thus can provide some insights for the predicted answer;(2)these intermediate results can help identify the inabilities of the image understanding or the answer inference part when the predicted answer is wrong.We conduct extensive experiments on a popular VQA dataset and our system achieves comparable performance with the baselines,yet with added benefits of interpretability and the inherent ability to further improve with higher quality explanations.
Keywords/Search Tags:Visual Question Answering, Attention Mechanism, Semantics, Graph, Relation Modeling, Knowledge Representation, Interpretability
PDF Full Text Request
Related items