Font Size: a A A

Visual Semantic Understanding Based Visual Dialogue

Posted on:2022-06-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:L ZhaoFull Text:PDF
GTID:1488306728465514Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Currently,with the rapid development of social digitization and informatization,artificial intelligence(AI)has entered a period of vigorous development.Many basic tasks in the field of computer vision(CV)and natural language processing(NLP)have made substantial breakthroughs,such as object detection,image segmentation,text classification,machine translation,and so on.However,with the emergence of visual data and text data brought about by information explosion,more and more attention has been paid to the cross-modal tasks of vision and natural language,such as cross-modal retrieval,visual captioning,visual question answering(VQA),and so on.Visual question answering is a typical cross-modal task,which answers the current question accurately according to the input image.Visual Dialogue still needs historical dialogue information to complete multiple rounds of continuous question answering.The core of visual dialogue is how to deal with cross-modal data,including visual information and text information.Visual information is the input image,while text information contains the captioning of the image,historical dialogue,and the current problem.The key that how to answer the question is to understand the semantics of the question and reason the visual semantic information,so as to get the final answer.When the questions are complex,for example,there may be pronouns or complex semantics,etc.,it is difficult to complete accurate reasoning,which causes poor continuity and low accuracy of answers.For the above problems,this dissertation proposes four visual dialogue algorithms based on visual semantic understanding.The main contents and contributions are as follows:(1)Aiming at the common problem of visual reference resolution in visual dialogue,an adaptive visual memory-based visual dialogue model is proposed,which applies an external memory bank to directly store grounded visual information.Both the textual and visual positioning processes are omitted,and then the possible errors in the two processes are effectively eliminated.When answering the question,it is not necessary to look for the specific reference of the pronoun in the question from the historical dialogue but to read it directly from the visual memory bank.Moreover,the answers may be produced only based on the question and image in many cases.The historical information somewhat causes unnecessary errors,so we adaptively read the external visual memory by learning confidence dynamically.Moreover,a residual queried image is fused with the attended memory,so as to better cope with multiple situations.(2)This dissertation proposes a structured external knowledge-based visual dialog pipeline for the common complex scenarios.Firstly,the commonsense knowledge derived from Concept Net is to supplement the external common sense missed in the visual dialog.We construct structured knowledge on the image and the caption to capture the semantic relevance among objects.To extract the relational context among entities of the graph,a Graph convolutional network(GCN)is applied to encode the knowledge graph.Finally,the experiments conducted on the public dataset show that the structured external knowledge can effectively enhance the reasoning ability and improve the response accuracy in a variety of scenarios.(3)In view of the limited applicable scene of common visual dialogue,this dissertation presents a goal-oriented visual dialogue method named ‘Guess Which' based on external knowledge.On the basis of the conventional visual dialogue only answering questions,the questions are also automatically generated.Moreover,the most similar images are retrieved from the image library at the end of each round of dialogue according to the questions and answers.In this dissertation,we use the extracted external knowledge about the image and the caption to enhance the generation ability of questions and the accuracy of answers.Then the ability of image guessing is improved.The experimental results prove the superiority of this method.(4)To alleviate the repetitive conversations in the goal-oriented visual dialog named‘Guess Which',a method with attentive memory network is proposed in this dissertation.Firstly,the memory network learns the different weights of the historical question-answer pairs at different times for the current dialog round.The weighted history information reduces the repetition of the generated dialogs and make image retrieval efficient.Secondly,we propose a novel Attentive Memory Network that adds a fusion model to the memory network.The fusion model effectively uses the caption information and the image feature.With the multivariate information fusion,the historical information is focused on the visual feature.Then the generated dialogs and the predicted image representation can be visually grounded.Finally,the experimental results prove the effectiveness of our method.Finally,this dissertation briefly summarizes the above main content,and prospects the future research work,and puts forward the feasible research ideas.
Keywords/Search Tags:Visual dialog, goal-oriented visual dialog, semantic features, memory network, attention mechanism, external knowledge, graph convolutional network
PDF Full Text Request
Related items