| Visual Dialog research has made significant progress in recent years by introducing various vision-oriented goals into conversations.However,there are still many challenges and issues,including the lack of diversity in questioning strategies and limited attention to dialog tasks.To address these issues,this paper first proposes an entity-enhanced question generation model,which consists of two components:the Related entity-enhanced Questioner(ReeQ)and the Augmented Guesser(AugG),that work together to improve the quality and effectiveness of the visual dialog(VD)questioner.The related entity-enhanced questioner generates questions guided by relevant entities and learns entity-based questioning strategies from human dialogs,while the augmented guesser is optimized for visual dialog and aims to make accurate image predictions based on the generated dialogs.To further understanding the scenario,this paper proposes a new visual dialog task called Dial-the-Diff,where two interlocutors access two similar images and attempt to discover the differences through natural language dialogue.The task aims to investigate difference-oriented questioning strategies and the ability to categorize objects in a scene in visual dialogue.The author constructs a large-scale multimodal dataset called DialDiff for the task,which includes 87k virtual reality images and 78k dialogues,highlighting the challenges behind the task.The author evaluates the entity-enhanced question generation model proposed in this paper on the VisDial v1.0 dataset and achieves state-ofthe-art performance in image-guessing tasks and question diversity.Specifically,our method outperforms previous methods in generating more visually relevant,informative,and coherent questions.Further human evaluations confirms the effectiveness of the proposed model.For the newly proposed Dial-the-Diff task,the author proposes a benchmark model and conducts extensive experiments to evaluate the performance and identify remaining challenges.This work contributes to the development of visual dialogue research and provides new methods and datasets to address key challenges in this field. |