With the rapid advancements in natural language processing and computer vision,the fusion of these two fields into multi-modal tasks has garnered significant attention,including visual question answering,visual dialogue,and image description.For tasks such as image description and visual question answering,models aim to facilitate human understanding of information in images through a single interaction.However,in real-life scenarios,the comprehension of image content is a progressive process,often requiring multiple interactions to grasp various aspects of the image.In order to enable machines to better simulate human understanding of images,researchers have proposed visual dialogue tasks in a multi-turn question-and-answer format.Visual dialogue requires the machine to fully comprehend the questions posed,engage in appropriate reasoning based on visual content and contextual information from the ongoing dialogue,and provide meaningful and continuous responses about the observed visual content in natural language.Therefore,this task presents a highly challenging cross-modal processing task,encompassing various research fields such as natural language processing,image processing,and machine learning.It holds significant scientific significance and promising practical applications.At present,the task has made great progress in multi modal information fusion and reasoning.However,the ability of mainstream models to answer some questions involving more well-defined semantic properties and location-space relationships is still limited.There is no necessary bridge between visual feature representation and text semantics such as dialogue history and current questions,which bridge the semantic gap.Therefore,this paper proposes a visual dialogue algorithm based on visual semantic dual-coding multi-channel inference model.The algorithm fully mines and extends the representation of image visual content,explicitly provides a set of fine-grained semantic description information about visual content,and constructs three multi modal information channels through visual-semantic-dialogue history,which can enrich the semantic representation of problems through the interaction between channels and multi-step reasoning.In addition,in order to obtain a more coherent Q&A response,this paper introduces semantic information as a knowledge base,participates in decoding,and adopts a multi modal decoder to achieve more accurate answer decoding.The proposed algorithm compares with mainstream algorithms on the large-scale public datasets Vis Dial v0.9 and Vis Dial v1.0 in the current visual dialogue generation task.Experimental results show that the model using multi-channel multistep inference achieves more advanced performance in the main evaluation indicators such as mean rank backward(MRR),recall rate and average rank of correct answer(Mean).From the results of our manual evaluation,the dialogue generated by the algorithm in this paper has been further improved in terms of relevance to the problem,accuracy,and coherence of the sentence. |