Font Size: a A A

Research On Language Ambiguity Elimination Methods In Visual Question Answering (VQA)

Posted on:2021-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:W DengFull Text:PDF
GTID:2438330626964358Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the development of multimedia and the Internet,how to deal with massive amounts of image and text information has become a problem that needs to be solved urgently.Therefore,the research on the intersection of computer vision and natural language processing has become the focus of scholars.Among them,the task of Visual Question Answering(VQA)is one of the hot topics of research.The visual question answering task refers to given a question and an image,requiring the machine to answer the question based on the understanding of the image.VQA involves related technologies such as semantic understanding,image detection and recognition,and knowledge reasoning.It requires machines to understand images in a human way and interact with users based on natural language.Therefore,it is very important to improve the intelligence of artificial intelligence systems such as robots.In the past few years,VQA has received extensive attention,so a lot of related work has emerged.Generally speaking,the visual question answering task needs to process the visual information of the image and the text information of the question at the same time,and map the extracted visual features and text features into the same high-dimensional space in a feature fusion manner.This requires the visual question answering model to correctly parse the semantics of the question,so as to give the correct answer combined with the visual characteristicsFor complex questions,due to the existence of language ambiguity,the existing models often bias the capture of text information,which makes it difficult for the existing VQA system to capture the true meaning of the question.When the answer is wrong,humans can try to understand the question in many other ways to get different answers.Inspired by this,this paper proposes a visual question answering method based on yes/no feedback.The specific process is as follows:The method in this paper uses the yes/no feedback mechanism to determine whether the answer is right or wrong for the first time.When the feedback information given by the user is no,our model will re-analyze the question and generate new questions after disambiguation,generating different candidate answers.Then output the highest confidence answer as the final result.This paper compares our method with existing methods on two benchmark datasets CLEVR and CLEVR-Co Gen T.On the CLEVR dataset,the accuracy of this method is close to 100%.On the CLEVR-Co Gen T dataset,the accuracy of this method is 21% higher than the existing method.
Keywords/Search Tags:Visual question answering, Computer vision, Natural language processing, Syntactic disambiguation, Feedback
PDF Full Text Request
Related items