Research On Multimodal Information Enhancement Method Towards Visual Question Answering

Posted on:2024-06-22

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Jiang

Full Text:PDF

GTID:2568307136487974

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

Visual Question Answering(VQA)requires computers to provide corresponding answers based on images and questions,serving as an essential branch in the research of cross-modal intelligence.As research in this field continuously deepens,VQA is evolving towards increased diversity and complexity,giving rise to two novel task scenarios: multi-image question-answering tasks and question-answering tasks involving external knowledge.In multi-image question-answering tasks,the model must accurately understand the semantics of both image content and text questions,establishing a reasonable connection between the two to ensure robustness when addressing different questions in the same scenario.Question-answering tasks that involve external knowledge require the model to associate the semantics of the image and question with external information to obtain accurate answers.Addressing these new requirements necessitates a substantial amount of annotated multimodal information.On the one hand,the first task demands numerous images similar to the corresponding question-image pairs to reinforce the model’s understanding of question-related image content.On the other hand,the second task calls for exploiting external knowledge bases to enrich the model’s repository of image-related knowledge.However,the current data acquisition method,which primarily relies on manual annotations,struggles to meet these data demands,resulting in poor robustness in multi-image VQA and low accuracy in VQA tasks involving external knowledge.Hence,this thesis investigates how to enhance multimodal information without relying on manual annotations and how to complement visual and semantic features from multiple perspectives in both text and image content,aiming to improve the performance of VQA models in specific scenarios.The main contributions of this work are as follows:(1)To address the issue of poor robustness in VQA tasks involving multiple images,the samples recombination VQA method is proposed.To solve the superficial correlation problem caused by limited data,this method recombines images and texts in the same scene from the original samples,generating a large number of unlabeled new samples.Then the method uses entropy minimization loss to generate reliable pseudo-labels for these unlabeled data during training.In addition,the method calculates the consistency loss between the model trained with and without new samples to prevent overfitting during the training process.Experimental results on the NLVR2,NLVR1,and SNLI-VE datasets show that this method can effectively improve the model’s robustness to different questions in the same scenario and enhance its reasoning ability.(2)To address the issue of low answer accuracy in VQA tasks involving external knowledge,a knowledge-enhanced VQA method is proposed.This method uses a text-image retrieval model to search for relevant textual knowledge from the Wikidata knowledge base.The textual knowledge is then refined through secondary filtration with GPT-3 generated candidate answers,reducing the interference of irrelevant information on the results.In addition,to apply the reasoning and understanding capabilities of large-scale language models(LLM)to VQA tasks,this method designs learnable prompt words to guide the model in mining the connection between external knowledge and questions,thus inferring the answer to the question based on understanding the external knowledge.Experimental results on the OK-VQA dataset show that this method can effectively improve the model’s accuracy when dealing with questions involving external knowledge.

Keywords/Search Tags:

Visual question answering, Data augmentation, Knowledge enhancement, Pre-trained model

PDF Full Text Request

Related items

1	Research On Chinese Knowledge Bases Question Answering Based On Pre-trained Language Model
2	Research On Question-answering Method Oriented To Small Data Volume Vertical Field
3	Research And System Implementation Of Graph-based Question Answering Method
4	Design And Implementation Of Visual Question Answering System Based On Knowledge Graph
5	Research On Multi-hop Question Answering Based On Sparse Knowledge Graphs
6	Research On Visual Question Answering Algorithm Based On Deep Learning
7	Research On Question Answering Method Based On Knowledge Graph Enhanced Representation
8	Research On Agricultural Products Knowledge Graph Question Answering
9	Research And Application Of Question Answering Methods Based On Semantic Parsing And Multi-Hop Reasoning
10	Research And Application On Knowledge Enhanced Reading Comprehension Question Answering