Font Size: a A A

Research On Visual Question Answering Algorithm Based On Deep Learning

Posted on:2024-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:R C LanFull Text:PDF
GTID:2568307061490274Subject:New Generation Electronic Information Technology (including quantum technology, etc.) (Professional Degree)
Abstract/Summary:PDF Full Text Request
Visual question answering is an emerging multimodal task that combines two different modalities,image,and text,at the same time,requiring a model that can correctly answer a series of natural language questions related to a specific picture.Compared with traditional text-based Question-Answering,visual question answering achieves simultaneous image and text understanding and cross-modal reasoning,which is not only a more advanced form of existing Question-Answering systems,but also a necessary way for further development of Artificial Intelligence.In addition,the visual question answering task is in line with the real-world perception faced by humans,and it has a wide range of applications in scenarios such as blind assistance,early childhood education,and intelligent customer service.In the visual question answering domain,the requirements for answering questions correctly include the following two things,accurate multi-step reasoning,and complete prior knowledge.Multi-step inference usually relies on deep learning model structures,while prior knowledge needs to be extracted from the Internet with the help of additional techniques.In general,this paper addresses these two issues and proposes improvements to existing methods to achieve a more accurate and interpretable visual question answering model.The specific findings of this paper are as follows.1、A shrinkage Transformer framework for multimodal accurate alignment(ST-VQA)is proposed.Although Transformer-based models have been very successful in the visual question answering(VQA)domain,their approach to achieve visual and text feature alignment is simple and coarse.This shortcoming has been further amplified in recent years with the popularity of visual-linguistic pre-training,resulting in the slow development of effective architectures for multimodal alignment.Based on this,we propose the Shrinking Transformer-Visual Question and Answering(ST-VQA)framework.It aims to achieve more accurate multimodal alignment than the standard Transformer.First,the ST-VQA framework uses region features as the visual representation of an image.Second,between different Transformer layers,the ST-VQA framework reduces the number of visual regions in the Transformer by feature fusion and ensures the differences between new regions by contrast loss.Finally,visual features and text features are fused and used for answer decision making.Extensive experiments show that our proposed method performs better than the standard Transformer without additional pre-training and outperforms some state-of-the-art methods on the VQA-v2 and other common datasets.2、A knowledge graph-enhanced Transformer(KGET)for VQA domain is proposed.Using the approach of integrating excellent models,current knowledge-enhanced visual question answering models have achieved good results.However,the existing model based on the pipeline connection method faces two important drawbacks.The first is that the excellent external models relied on are trained in the face of general-purpose application scenarios and lack fine-tuned adaptation to dedicated downstream scenarios;The second is that such methods do not explore the introduction and enhancement of knowledge in the VQA model and lack the necessary interpretability.Based on this,to investigate more effective and explanatory knowledge-enhanced visual question-and-answer models,we propose a simple Knowledge Graph Embedding Transformer framework(KGET).Specifically,KGET contains a text branch,a visual branch,and a knowledge branch,with cross-modal guidance between the different branches through a guided attention layer.Further,to achieve the effective introduction of knowledge,we train an embedding for all entities of the knowledge base based on the knowledge graph embedding technique,which in turn represents knowledge as features.Finally,the knowledge-guided visual and textual features are fused to classify and decide the answer.We have conducted extensive comparison experiments on the OK-VQA dataset,and the experimental results demonstrate the effectiveness of KGET.
Keywords/Search Tags:visual question answering, Multi-model, Knowledge-Graph, Transformer
PDF Full Text Request
Related items