Font Size: a A A

Deep Convolutional Network And Regional Attention Network For Visual Question Answering

Posted on:2020-08-08Degree:MasterType:Thesis
Country:ChinaCandidate:P P ZengFull Text:PDF
GTID:2428330596976513Subject:Engineering
Abstract/Summary:PDF Full Text Request
Vision and language are two core parts of human intelligence to understand the real world.They are also fundamental components in achieving artificial intelligence,and have been extensively research in each area.Recently,tremendous advances in deep learning have broken the boundaries between vision and language,and interdisciplinary research problem has axtensive attention,such as visual question answering,image captioning,image-text matching,etc.Given an image(or a video)and a corresponding question in natural language,VQA can reason over visual content of the image to infer the correct answer according to the question.VQA can be used to improve human-computer interaction as a natural way to query visual content,which have many potential applications.The most immediate is as an aid to blind and visually impaired individuals,enabling them to get information about images both on the web and in the real world.Moreover,VQA is an important basic research problem.Because a good VQA system can be able to solve many computer vision problems,and it can be considered a component of a Turing Test for image understanding.VQA system requires not only requires a strong understanding of the picture,but also sophisticated natural language processing techniques to encode the question.As an emerging research direction,challenges faced by the VQA system are enormous and require us to learn and resolve.There are many ways to classify VQA systems.According to the type of answer,it can be divided into open-ended VQA and multiple-choice(MC)VQA.According to the type of visual input information,it can be divided into ImageQA and VideoQA.Attention mechanisms are widely used in different types of VQA systems,but they are not well explored.In the open-ended ImageQA,the visual attention mechanism is the most commonly used.However,existing visual attention model only focus on the regional characteristics of the image CNN feature,but ignore the channel information of the CNN feature.In previous works for MC ImageQA,MC task used the trained open-ended ImageQA models,and did not make full use of the option information.For the video QA,the timing information of the video is not fully utilized,and the text attention mechanism is not considered.To solve above problems,,we have proposed different framework for the specific VQA system in this paper:1)Cubic Attention Network(CVA),which is mainly for open-ended ImageQA.We take the fully advantage of two two characteristics(i.e.,channel and spatial)of the CNN layer,and propose a novel Cubic Visual Attention(CVA)framework by successfully applying a channel attention and a spatial attention to assist VQA task.channel-wise attention can be viewed as the process of selecting semantic attributes on the demand of the sentence context.And Experimental results show that our proposed method significantly outperforms the state-of-the-arts on three public imageQA datasets,.2)Multi-task learning and adaptive attention,specifically designed for the MC VQA task.We first fuse the answer options and question features,and then adaptively attend to the visual features for inferring a MC VQA.Furthermore,we design our model as a multi-task learning architecture by integrating the open-ended VQA task to further boost the performance of MC VQA.It sets a new record at both two MC datasets.3)Structured Two-stream Attention Network for VideoQA.We introduce a new structure,namely structured segment,that capture the rich video information.Our structured two-stream attention component can simultaneously avoid the influence of background video,and attend to both spatial and long-range temporal information of a video as well as text.Our proposed method significantly outperforms the state-of-the-arts on the TGIF-QA dataset.
Keywords/Search Tags:Deep Learning, Computer Vision, Natural Language Processing, Visual Question Answering, Attention Mechanisms
PDF Full Text Request
Related items