| With the rapid development of computer hardware and computing power,artificial intelligence is widely utilized in various fields.Multimodal tasks,which require the use of different forms of information,are currently a hot topic in artificial intelligence research.However,due to the heterogeneity gap between multimodal information,machines are unable to fully utilize and fuse the information from different modalities.Therefore,the challenge in current research is how to adequately represent multimodal information.Visual question answering,as an important sub-task in multimodal tasks,requires answers based on images and image-related questions.However,the problem of inadequate representation of multimodal information also exists in this task.There is a strong correlation between questions and answers,but only a superficial correlation between images and answers.Therefore,multimodal information cannot be fully utilized for answer prediction,leading to poor model robustness and failure to make correct answers in scenarios outside the training dataset.Dataset expansion is labor-intensive and time-consuming,and the privacy of data in special domains makes dataset expansion difficult to achieve.To address this challenge,this paper explores the representation learning of multimodal information in the visual question answering task and improves it from two aspects.(1)To address the issue of limited learning capacity of mutual representation of two modalities in the existing visual question answer models,the Global-Local Attention Network(GLAN)approach is proposed.This is done by analyzing the model’s inability to differentiate between similar targets in images and its overreliance on text information.To overcome the lack of local information in global features,the model incorporates an attention module that captures fine-grained image features using a Mix attention mechanism to supplement the finegrained image information.Additionally,convolution-guided attention is utilized to enhance image representation and increase the bias of image information in the model.Experiments are conducted on the VQAv2 and GQA datasets,and the proposed GLAN model improves accuracy by 0.65% and 0.41%,respectively,compared to the baseline model,demonstrating its ability to effectively enhance the model’s inference ability.(2)In order to address the issue of linguistic bias in existing visual question answer models caused by insufficient representation of multimodal features,a multi-feature enhancement approach is proposed and validated in two different models by fusing one or more features expressed differently in a way that they are merged with the original features.The initial validation is conducted on the VQAv2 dataset using the MCAN model by fusing textual and visual features with processed bimodal features to enhance feature representation.The multifeature enhancement method is then applied to the CFVQA model,resulting in the optimized DCFVQA model,which incorporates discrete cosine transformed features into the counterfactual causality structure to refine the indirect relationship between the image and the question,better remove linguistic bias,and fuse the obtained features.In the preliminary validation,the optimization method improves 0.17% compared to the MCAN model,and DCFVQA improves 0.85% compared to CFVQA,proving the effectiveness of the proposed approach.To summarize,this study explores multimodal representation learning and evaluates its effectiveness in the context of visual question answering.By incorporating fine-grained features and multi-feature augmentation,the study aims to reduce the model’s dependence on unimodal information and enhance the representation of multimodal information.The findings indicate that both methods improve the representation of multimodal information and effectively enhance the accuracy and robustness of the model. |