| With the continuous accumulation of image,text,video,audio and other cross-media data,and the continuous development of deep representation learning,the processing and analysis of information is gradually changing from a single form to a variety of forms.Therefore,multimodal tasks represented by visual question answering have also received extensive attention from researchers.Visual question answering tasks are designed to test the machine’s ability to understand images and questions,asking the machine to answer questions given pictures and natural language questions.Although recent researches have greatly advanced the development of visual question answering task,there are still two problems that severely limit the practical application of visual question answering task.One problem is that the model is overly affected by language bias,and the generalization is poor;the other problem is that many methods based on pre-training models have high data labeling costs and transfer costs,and poor generality.To address the above issues,we build on cross-modal representation learning by designing new learning strategies to effectively exploit language bias.And combined with prompt learning,a more general visual question answering model is proposed.The main contents of this master thesis include:1.Existing methods focus on reducing the impact of language bias on the model,which also weakens the model’s ability to learn context priors.In view of this deficiency,this paper proposes a CCB learning strategy.The CCB learning strategy first needs to build the content branch and the context branch,and then uses the language bias to build a joint loss function to optimize these two branches and the final answer prediction.Specifically,in the content branch,the model needs to focus on local key information and re-weight different samples through language bias to reduce the impact of statistical priors on the model.In the context branch,the model needs to pay attention to the global effective information and construct context labels through language bias to retain the ability of the model to learn context priors.Finally,the model fuses the predictions of the two branches to get the final answer.2.As the complexity of the pretrained model continues to increase,the hardware requirements and training cost of model fine-tuning continue to increase.In response to this problem,this paper proposes the OCAP model,which introduces the new research paradigm of "Pre-Train,Prompt,Predict" into in the visual question answering task.The OCAP model makes full use of CLIP’s encoding ability for visual and textual data,and adaptively learns the construction of "Prompt" by optimizing the context vector of the answer to complete the answer matching task.3.In this paper,the proposed methods are verified on two public datasets,VQA v2.0and VQA-CP v2.0.The experimental results show that the CCB learning strategy can effectively improve the performance of the model on the latter,and can significantly reduce the performance difference of the model on the two datasets.The OCAP model also achieves competitive performance without training with additional multimodal data. |