Font Size: a A A

Research On Visual Question Answering Technology Based On Multimodal Information Alignment

Posted on:2024-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:Q H XiaFull Text:PDF
GTID:2568307067493724Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Visual Question Answering(VQA)requires computer to answer questions related to the content of a given image.It has broad application prospects in fields such as medical imaging,question answering for visually impaired individuals,and humancomputer interaction,therefore it has received widespread attention from scholars both domestically and internationally in recent years.How to establish an accurate mapping relationship between visual features and text features,and further fuse and infer the two modal information are key issues in the field of VQA.This paper focuses on the alignment,fusion,and inference of multimodal information and conducts the following research:(1)A visual question answering algorithm based on position alignment is proposed to address the issue of visual question answering models not being able to effectively infer the positional relationships between objects in the image based on the question content.The position information of the object in the image is mapped to a higher dimension to generate the image position encoding,and different position encoding methods are designed to compare the effects.Establish effective semantic associations between image features and question features that integrate position encoding through intra-modal and inter-modal attention networks.The experiment on the VQA-v1,VQAv2,and COCO-QA datasets verifies the effectiveness of the proposed model,with overall accuracy rates of 68.7%,66.9%,and 69.6%,respectively;(2)A visual question answering method based on Aggregated Multi hop Attention Network(AMAN)is proposed to address the issue of insufficient aggregation of surrounding information in the process of establishing connections between text and images via the attention mechanism in visual question answering models.The image objects and question words in visual question answering are treated as nodes in the graph neural network.When calculating the attention weight between the two nodes,select the surrounding node information related to the content of the two nodes to add to the calculation of the current two nodes,and achieve attention diffusion by calculating the higher-order adjacency matrix;Stacking multi-layer attention networks to achieve hierarchical reasoning processes.Afterwards,single stream and dual stream models are designed for performance comparison.The overall accuracy on the VQA-v1,VQA-v2,and COCO-QA datasets reached 69.37%,69.87%,and 70.62%,respectively,verifying the effectiveness of the proposed method(3)In order to further establish the semantic association between image features and text features,a multimodal pre-training framework based on AMAN is proposed.This method first uses a large image text dataset to pre-train the model,and then performs further migration learning for visual question answering tasks.In the feature extraction section,a Transformer structure is used to replace traditional object detection networks and recurrent neural networks.Before multimodal fusion,multimodal alignment is achieved through contrast learning between images and text.In addition,in order to solve the noise problem in the dataset and improve the generalization ability of the model,a mutual learning network based on knowledge distillation is used to enable the sub-networks to learn their prediction results from each other.The experiment on VQA v2 dataset shows that the accuracy of the proposed model after pre-training can reach 72.23%.
Keywords/Search Tags:Visual Question Answering, multimodal position alignment, multi-hop attention network, pre-training
PDF Full Text Request
Related items