| In recent years,multimodal interaction tasks based on computer vision and natural language processing have progressed.However,there are also some challenging problems such as class imbalance and modal alignment.Visual question answering(VQA)is a classic multimodal interaction task that uses pictures and questions to obtain text answers related to image content.However,the visual question answering task has language problems.That is,the models opt to make naive decisions straightly based on the co-occurrence patterns between textual questions and answers,without fully considering the visual evidences in images,and consequently suffer a degradation of generalization.The distribution of candidate answers in relevant datasets is also unbalanced.Based on this,this paper proposes two methods to deal with language prior:For visual question answering tasks,the model is too dependent on language prior,less considering visual content.Some previous methods to enhance visual sensitivity can only help questions find a correct region without increasing the importance of the image in answering the question.Most of the latest methods to mitigate language prior problem only focus on the relationship between basic and question-only models.In order to overcome the language prior problem in VQA,this paper proposes a method to increase the visual content further to increase the impact of images on the answer.This method consists of three parts: the basic model branch,the question-only model branch,and the visual model branch.The experiments show that this method improves the accuracy of the five classical models to different degrees on the three datasets of VQA-CP v1,VQA-CP v2,and VQA v2,respectively,which proves the effectiveness of this method and provides a new idea for alleviating the language prior problem.In addition,this paper analyzes the influence of the cosine similarity loss and the contrast loss in the method in this section.In addition,in order to avoid the influence of the ’ inversion ’ structure of the dataset and further balance the distribution of the dataset,this paper improves the acquisition method of this bias.We propose a shuffling bias(LMS)model.Specifically,this paper only shuffles the bias caused by the question type in the training set.We combine it with the self-supervised model for two-stage learning to further improve the ability of the model to overcome language prior.In addition,some methods adding questiononly models prevent the positive bias of model learning that is useful for generalization while dealing with language priors.Therefore,this paper further analyzes some limitations and applications of the previous use of question-only bias.The experiment shows that the accuracy on the VQA-CP v2 dataset is 60.75 %.The accuracy on the VQA 2.0dataset is also better than that of other methods dealing with language prior,reaching65.59 %.Furthermore,under the condition of the same scale training parameters,the two-stage training method for increasing shuffling bias proposed in this paper reaches the highest level on the VQA-CP v2 dataset. |