Research On Visual Dialogue Generation Model Based On Visual Semantic Dual Encoding Multi-channel And Multi-step Reasoning

Posted on:2024-04-06

Degree:Master

Type:Thesis

Country:China

Candidate:S H Chen

Full Text:PDF

GTID:2568307112976419

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid advancements in natural language processing and computer vision,the fusion of these two fields into multi-modal tasks has garnered significant attention,including visual question answering,visual dialogue,and image description.For tasks such as image description and visual question answering,models aim to facilitate human understanding of information in images through a single interaction.However,in real-life scenarios,the comprehension of image content is a progressive process,often requiring multiple interactions to grasp various aspects of the image.In order to enable machines to better simulate human understanding of images,researchers have proposed visual dialogue tasks in a multi-turn question-and-answer format.Visual dialogue requires the machine to fully comprehend the questions posed,engage in appropriate reasoning based on visual content and contextual information from the ongoing dialogue,and provide meaningful and continuous responses about the observed visual content in natural language.Therefore,this task presents a highly challenging cross-modal processing task,encompassing various research fields such as natural language processing,image processing,and machine learning.It holds significant scientific significance and promising practical applications.At present,the task has made great progress in multi modal information fusion and reasoning.However,the ability of mainstream models to answer some questions involving more well-defined semantic properties and location-space relationships is still limited.There is no necessary bridge between visual feature representation and text semantics such as dialogue history and current questions,which bridge the semantic gap.Therefore,this paper proposes a visual dialogue algorithm based on visual semantic dual-coding multi-channel inference model.The algorithm fully mines and extends the representation of image visual content,explicitly provides a set of fine-grained semantic description information about visual content,and constructs three multi modal information channels through visual-semantic-dialogue history,which can enrich the semantic representation of problems through the interaction between channels and multi-step reasoning.In addition,in order to obtain a more coherent Q&A response,this paper introduces semantic information as a knowledge base,participates in decoding,and adopts a multi modal decoder to achieve more accurate answer decoding.The proposed algorithm compares with mainstream algorithms on the large-scale public datasets Vis Dial v0.9 and Vis Dial v1.0 in the current visual dialogue generation task.Experimental results show that the model using multi-channel multistep inference achieves more advanced performance in the main evaluation indicators such as mean rank backward(MRR),recall rate and average rank of correct answer(Mean).From the results of our manual evaluation,the dialogue generated by the algorithm in this paper has been further improved in terms of relevance to the problem,accuracy,and coherence of the sentence.

Keywords/Search Tags:

Visual Dialogue, Generative Task, Visual Semantic Description, Multi-step Reasoning, Multi-Channel Fusion

PDF Full Text Request

Related items

1	Visual Dialogue Algorithm Based On Multi-level Semantic Information Fusion And Reasoning
2	Research On Key Techniques Of Visual Semantic Understanding
3	Exploring Multi-Step Reasoning And Visual Localization In Video Question Answering
4	Research On Visual Perception Technology Based On Multi-modal Fusion
5	Research On Audio Visual Fusion Speech Separation Method For Multi-person Dialogue Robot
6	Research And Application Of Visual Dialogue Based On Dialogue State Tracking
7	The Fine-Grained Visual Explanation Generative Model Based On Multimodal Fusion
8	Research On Visual Description Technology Based On Deep Learning
9	Semantic Network Based Image Information Representation And Visual Reasoning
10	Visual Semantic Understanding Based Visual Dialogue