Font Size: a A A

Research On Multimodal Information Enhancement Method Towards Visual Question Answering

Posted on:2024-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y L JiangFull Text:PDF
GTID:2568307136487974Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Visual Question Answering(VQA)requires computers to provide corresponding answers based on images and questions,serving as an essential branch in the research of cross-modal intelligence.As research in this field continuously deepens,VQA is evolving towards increased diversity and complexity,giving rise to two novel task scenarios: multi-image question-answering tasks and question-answering tasks involving external knowledge.In multi-image question-answering tasks,the model must accurately understand the semantics of both image content and text questions,establishing a reasonable connection between the two to ensure robustness when addressing different questions in the same scenario.Question-answering tasks that involve external knowledge require the model to associate the semantics of the image and question with external information to obtain accurate answers.Addressing these new requirements necessitates a substantial amount of annotated multimodal information.On the one hand,the first task demands numerous images similar to the corresponding question-image pairs to reinforce the model’s understanding of question-related image content.On the other hand,the second task calls for exploiting external knowledge bases to enrich the model’s repository of image-related knowledge.However,the current data acquisition method,which primarily relies on manual annotations,struggles to meet these data demands,resulting in poor robustness in multi-image VQA and low accuracy in VQA tasks involving external knowledge.Hence,this thesis investigates how to enhance multimodal information without relying on manual annotations and how to complement visual and semantic features from multiple perspectives in both text and image content,aiming to improve the performance of VQA models in specific scenarios.The main contributions of this work are as follows:(1)To address the issue of poor robustness in VQA tasks involving multiple images,the samples recombination VQA method is proposed.To solve the superficial correlation problem caused by limited data,this method recombines images and texts in the same scene from the original samples,generating a large number of unlabeled new samples.Then the method uses entropy minimization loss to generate reliable pseudo-labels for these unlabeled data during training.In addition,the method calculates the consistency loss between the model trained with and without new samples to prevent overfitting during the training process.Experimental results on the NLVR2,NLVR1,and SNLI-VE datasets show that this method can effectively improve the model’s robustness to different questions in the same scenario and enhance its reasoning ability.(2)To address the issue of low answer accuracy in VQA tasks involving external knowledge,a knowledge-enhanced VQA method is proposed.This method uses a text-image retrieval model to search for relevant textual knowledge from the Wikidata knowledge base.The textual knowledge is then refined through secondary filtration with GPT-3 generated candidate answers,reducing the interference of irrelevant information on the results.In addition,to apply the reasoning and understanding capabilities of large-scale language models(LLM)to VQA tasks,this method designs learnable prompt words to guide the model in mining the connection between external knowledge and questions,thus inferring the answer to the question based on understanding the external knowledge.Experimental results on the OK-VQA dataset show that this method can effectively improve the model’s accuracy when dealing with questions involving external knowledge.
Keywords/Search Tags:Visual question answering, Data augmentation, Knowledge enhancement, Pre-trained model
PDF Full Text Request
Related items