Font Size: a A A

Fill-in-the-blank Image Question Answering Based On Gated Recurrent Units

Posted on:2020-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:J HuFull Text:PDF
GTID:2518306512956679Subject:Intelligent computing and systems
Abstract/Summary:PDF Full Text Request
With the upgrade of smart devices and the rapid development of network technologies,the images that people are exposed to are becoming more and more diverse.How to make the machine better understand the semantic content of images and facilitate people's lives has become one of the hot spots of image research,especially study and breakthrough in the field of deep learning and cross-modality(computer vision and natural language processing)in recent years.The task of image question answering has become an important research direction of artificial intelligence.According to the contribution of different tasks,the researchers have proposed many methods.And the basic idea is usually to predict the answer by mapping the combination of image features and text features,or to use the attention mechanism to perform object space on the basis of pixel level.Some researchers have combined external knowledge to improve the accuracy of the image question answering.This paper considers to fuse the global and local visual features of the image to make full use of the visual semantic information expressed in the image question answering process.Then we propose a novel Semantic Bi-Embedded fill-in-the-blank style image question answering model,which learns the relevance of cross-modal semantic information to predict answers.We do experiments from the public image question answering dataset named Visual Madlibs,and compare it with the latest methods and designed baselines.This paper mainly includes the following research contents:(1)The task of image question answering in this paper using the fill-in-the-blank style questions with multiple candidate answers.Most image question answering focus on the level of visual information,while ignoring the semantic information provided by the question itself.Considering that the specificity of the question is easier to track and understand than the image features.When the word vector features fed into the training model,the textual semantic information before and after the space can provide effective logical reasoning direction and content prompt for the model prediction answer.Especially in the complex task scenarios based on time series or image emotion analysis,the style of the question answering in this paper can make full use of the advantages of the Gated Recurrent Unit and obtain the semantic information of the text.(2)It is proposed to use the Semantic Bi-Embedded Gated Recurrent Unit(SBE-GRU)to fuse image features and text features.This model structure can be more closely related to the Semantic information between the beginning and the end of the question sentence when it comes to active,passive or time-based style questions.We also use the extended GRU network to maintain the semantic consistency of vision and language in a high-dimensional space.In addition,we directly use the answer list in the candidate answer to participate while training,which can improve the efficiency of the model training and help the model predict the answer accurately and effectively.(3)It is proposed to fuse the global and local features of the images to describe deep semantic information.Usually,global features represent the general scene of the image,local features represent the precise information of the objects in the image,and then the two features are merged to participate in the model training.This allows the computer to "see" the image and "understand" the image content during training.At the same time,the attention mechanism is introduced in the model,which make the question answering model focus on the specific regions of the given image according to the questions and reduce the image noise introduced by the global visual information.
Keywords/Search Tags:fill-in-the-blank, image question answering, convolutional neural network, gated recurrent unit, global and local image features
PDF Full Text Request
Related items