Font Size: a A A

Multi-modal Information Fusion In Visual Question Answering

Posted on:2019-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y PangFull Text:PDF
GTID:2428330548477412Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The rise of deep learning sparked another wave of artificial intelligence,which stimulated researchers to explore the cognitive ability of the machine.Image understanding as an important ability of human to know the world has drawn much attention.Recently,a number of tasks have been proposed for testing the ability of image understanding,which promote the research in this area.Visual question answering(VQA)that expects the machine to answer the question about an image is one of the most popular task.Compared to image captioning and blank filling task,VQA allows simpler inputs and can be evaluated more easily.The research on VQA is of great signifi-cance.In theory,VQA is considered as an AI-complete task and can be used as a replacement for Visual Turing Test.In practice,the system that can answers questions about an image has a very wide range of applications.Modeling the interaction between the image and the question,which relects the process of correlating the semantics in image and question and further reasoning the answer,lies at the heart of VQA.A great deal of previous research has expected to model this process more effectively by enhancing the expression ability of feature fusion operations.However,it is well known that there is a natural semantic gap between images and questions which come from different modalities,which hinders the direct interaction between images and questions.On the other hand,associating the semantic in image and text is a complex and common ability and hard to be learned through the annotated data from VQA task only.Therefor,we propose a method of simplifying the interaction between image and question by supplementing the image with corresponding textual data.This method has two advantages:First,supplementing textual data to images can be supervised with additional training data,which will make a better connection between image and textual informa-tion;Second,the interaction between the question and the image information in textual form is easier to be modeled.In this paper,we first propose a single-modal question-answer model that transforms VQA to textual question answering task by converting the image into a corresponding textual description and then resolves the textual question answering task by a GRU.The single-modal question-answer model outperforms baseline models and achieves performance comparable to models with atten-tion mechanism on COCO-QA dataset,Subsequently,in order to make up for the shortcoming of image information loss in single-modal question-answer model,we further proposed the feature enhancement model,which expresses image as textual and visual feature at the same time,and explorted the relationship between textual feature and attention mechanism.Our feature enhance-ment model achieves world-class performance on a balanced version of the most popular VQA dataset.
Keywords/Search Tags:Multimodal Learning, Feature Enhancement, Visual Question Answering, Deep Learning
PDF Full Text Request
Related items