| With the continuous development of artificial intelligence technology,natural language processing and computer vision,as the core technical fields of artificial intelligence,have achieved remarkable achievements.Visual data and text data in various fields are growing explosively.How to effectively interact visual data(pictures,videos)and text data,and extract,filter,and infer effective information from them is an important challenge in the field of artificial intelligence.Based on the above challenges,researchers have proposed many cross-modal tasks such as image description,visual question answering and visual dialogue.Among them,the visual dialogue task aims to accurately answer continuous questions around visual content.The key to the visual dialogue task is to accurately understand the semantics of the question,locate the correct target from the picture,and then infer the correct answer.However,there are a large number of pronouns in the historical dialogue information in visual dialogue,and the dialogue model may not be able to determine the target entity referred to by the pronouns,resulting in biased answer results.In order to deal with the problem of unclear reference,this thesis conducts the following research contents:1.Aiming at the ubiquitous visual reference resolution problem in visual dialogue,a visual dialogue model based on double soft constraints is proposed.2.Based on the linguistic knowledge that the antecedents of pronouns can only be nouns or noun phrases,the first soft constraint is proposed,and learnable part-of-speech tags and part-of-speech tag prediction losses are introduced.3.Based on the fact that the reference of pronouns in dialogue often occurs in nearby dialogues,a second soft constraint using sinusoidal position encoding sentences is proposed,aiming to enhance the local interaction between sentences.In order to verify the effectiveness of the visual dialogue technology based on double soft constraints proposed in this thesis,in Vis Dial v0.9,Vis Dial v1.0 and Guess What? !Extensive experiments were conducted on this model on three datasets,including quantitative experiments,qualitative experiments,and ablation experiments.The experimental results show that the method based on double soft constraints has achieved better results than the previous methods.It can effectively resolve the entities referred to by pronouns and improve the accuracy of answers for visual dialogue models. |