Font Size: a A A

Visual Semantic Representation For Visual Dialog

Posted on:2020-04-30Degree:MasterType:Thesis
Country:ChinaCandidate:Q WangFull Text:PDF
GTID:2518306518463054Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advances of computer vision and natural language processing,various vision-language tasks have attracted much attention in multimedia content analysis.However,previous tasks,such as Image Captioning and Visual Question Answering(VQA),can only help humans gain a basic understanding of the visual information in the image through a single interaction while the content understanding for the image is a gradual process in real life.Therefore,it is necessary to accumulate the content understanding for the image through multiple interactions.To better simulate the real interaction of human,researches introduce the Visual Dialog task which always contains multiple rounds of question and answer.The goal of Visual Dialog is to answer a sequence of questions in the form of a dialog given an input image.Therefore,it requires a deep understanding of the visual information for the image as well as the semantic information for the dialog history and the targeted question.Besides,it also needs to analyze and utilize the relationship between the information from different modal.In this paper,we study how to solve the Visual Dialog task with effective visual representation of the image as well as the semantic representation of the dialog history and the targeted question.We propose a visual selection approach for visual dialog.Considering that each question in the dialog only focuses on parts of the image,the relevant objects(regions)selection for the image can improve the accuracy of question answering in visual dialog.This method is composed of three modules.Firstly,the visual feature extraction module extracts meaningful object(region)features of the image.Secondly,the visual selection module produces the semantic guidance based on the dialog history and the question,then selects the relevant object(region)features according to specific semantic guidance.In this module,we design three kinds of semantic guidance and three types of visual feature selection methods.Thirdly,the multi-modal fusion module fuses the final visual feature,the question feature and the dialog history feature.The corresponding answer is predicted according to the similarity between the fused feature with all the candidate answer features.And the candidate answer with the highest similarity is taken as the predicted answer.We also propose a multi-level attention method for visual dialog,which focuses on both high-level and low-level information of the dialog history,the question,and the image.This approach is composed of four modules.Firstly,the feature extraction module extracts the image feature,the question feature,and the dialog history feature.Secondly,the low-level attention module enhance the representation of words in the sentence of the dialog history and the question based on the word-to-word connection,as well as to enrich the region information of the image based on the region-to-region relation.Thirdly,the high-level attention module select important words in the sentence of the dialog history and the question as the supplement of the detailed information for semantic understanding,as well as to select relevant regions in the image to provide the targeted visual information for question answering.Finally,the multi-modal fusion module fuses the final visual feature,the question feature and the dialog history feature.The corresponding answer is predicted according to the similarity between the fused feature with all the candidate answer features.And the candidate answer with the highest similarity is taken as the predicted answer.
Keywords/Search Tags:Visual Dialog, Visual Representation, Semantic Representation, Attention Mechanism
PDF Full Text Request
Related items