Font Size: a A A

Research On Language Bias Of Visual Question Answering Model

Posted on:2024-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z R ZengFull Text:PDF
GTID:2568306914461774Subject:Communication Engineering (including broadband network, mobile communication, etc.) (Professional Degree)
Abstract/Summary:PDF Full Text Request
Visual Question Answering is a multi-modal task that takes pictures and text questions as input and aims to infer answers based on textual and visual cues.It requires the model to be able to perceive visual and natural language text features,and conduct joint reasoning of answers based on text and visual cues.The current work shows that visual question answering models tend to over-rely on the language bias in the data set for answer reasoning,while ignoring the visual cues from the pictures,which makes the models show a low generalization in the out-of-distribution data,such as VQA-CP,a data set that tests the generalization ability of visual question answering models.Therefore,this paper mainly focuses on the problem that visual question answering models rely too much on language bias.It mainly focuses on three aspects of innovation:reducing language bias,correcting model bias and guiding the model to focus on important areas of pictures.The specific work contents are as follows:(1)One reason for language bias is that there is a high frequency cooccurrence relationship between text words,which is embodied in the high co-occurrence frequency of text words and certain specific labels in the training set.The model may catch such superficial statistical correlation,which is not the intrinsic feature of the sample,so it is not conducive to the generalization of the model.Since one of the reasons for language bias is the high-frequency co-occurrence relationship between text words,then the language bias can be alleviated by reducing such co-occurrence relationship.In this paper,we use the method of retelling generation to get text problems with the same semantics and different representations,thus reducing the frequency of lexical co-occurrence.In addition,in order to avoid changing the distribution of the dataset,the paraphrases are not directly used as a new sample to expand the dataset and participate in training,but are only used to optimize the hidden layer representation of the text through contrastive learning.(2)The existence of language bias will inevitably lead to the short-cut learning problem of the model.Therefore,in addition to reducing the language bias of the dataset,the model can also be optimized by correcting the short-cut learning behavior of the model.Specifically,this paper uses a"question only " branch to identify the short-cut learning phenomenon in the model.The input of this branch is only text.If the branch still correctly infers the answer when it only inputs textual questions without pictures,it must be a shortcut learning based on the prior knowledge of the text.After capturing the bias of the model,our paper uses the captured bias to correct the model,so as to achieve unbiased prediction of the model.(3)In order to guide the model to predict the answer based on visual clues,this paper introduces confusion loss and confidence loss.The confusion loss is to mask the key areas on the picture first,and then make confusion prediction on the sample.The confidence loss is to strengthen the confidence level of the model in this prediction when the picture is unmasked and the model accurately predicts the answer.Based on the above innovative research,this paper implements a visual question answering system.First,by inputting a text question and a corresponding picture on the client,and then the server will call the visual question answering model to predict the answer,and finally return the generated answer to the visual client interface.
Keywords/Search Tags:visual question answering, language bias, contrastive learning
PDF Full Text Request
Related items