Font Size: a A A

Research And Application Of Visual Dialogue Based On Dialogue State Tracking

Posted on:2022-07-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:W PangFull Text:PDF
GTID:1488306326480494Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,Visual Dialogue(VD)has attracted much attention,lying at the intersection of Computer Vision(CV)and Natural Language Processing(NLP).A text-based dialogue system is based on natural language,while most people's daily conversations occur in a specific real-world scene.Visual dia-logue is an extension of text-based dialogue and integrates visual scenes and natural language.Visual dialogue has a wide range of industrial applications.A conversational agent with visual dialogue can be implanted into household robots,mobile phones,or cars to become the assistant and partner in people's lives and change the traditional lifestyle formed by people over thousands of years.Previous work focused on encoding language information in dialogue,integrating images as fixed visual features into multiple rounds of language coding,ignoring the interaction between language information and visual infor-mation.This paper has carried out a series of studies on the mentioned issue.To summarize,our contributions are as follows:We first present a visual dialog state for visual dialogue.We argue that a dialogue state could fuse both visual and linguistic information as the process of conversation.New knowledge introduced would update the dialogue state.We use the updated dialogue state as the basis for making decisions at present.Based on the defined visual dialogue state,we first propose a visual dialogue state tracking based model on the game of GuessWhat?!,which realizes the dynamic interaction between language and vision in the conversion process.Especially the success rate of guessing 83.3%on GuessWhat?!is approaching the human-level accuracy of 84.4%.Next,we extend limited vision to non-limited vision and apply it to the QBot in GuessWhich.Specifically,inspired by the human Dual-coding theory,we model question generation based on mental images for the first time under unlimited vision information.The mental image refers to imagery visual scene that is generated from text-based dialogue by humans.Compared with real photos,a mental image is a virtual image.It conforms to the dual-coding theory for the human cognitive process,which postulates that language and visual scene associate with each other.Compared with the state-of-the-art model that introduces real images with 96.09%on PMR,our model achieves PMR 95.91%,outperforming other models without visual information.Finally,we introduce the idea of visual dialogue state into dialog response generation based on limited visual information,such as ABot in GuessWhich and Visual Dialog.We think of dialogue history as an ongoing dialogue.Then we define a dialogue state and reconstruct it from the dialogue history.The answer to a new question is generated based on the obtained dialogue state.Moreover,we first combine co-reference resolution and visual grounding into the same multimodal process.The co-reference resolution refers to finding the entity in dialogue history mentioned by the pronoun in a question.Visual grounding refers to mapping questions to the relevant visible object in the image.Our model is based on a simple LSTM encoder-decoder,NDCG is up to 61.01,Mean is 17.21,outperforms other models with the same encoder-decoder.In conclusion,we first introduce the visual dialogue state to the visual dialogue task and present a visual dialogue state-based tracking method on visually-grounded question generation based on limited and unlimited vision information and dialogue response generation.We obtain good experimental results on GuessWhat?!,VisDial,and GuessWhich tasks.
Keywords/Search Tags:Visual Dialogue, Questioning State, Guessing State, Visual Dialogue State Tracking, Multi-Modal Dialogue, Vision and Language
PDF Full Text Request
Related items