Font Size: a A A

Research Of Visual Question Answering Method Based On Deep Learning

Posted on:2024-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LiuFull Text:PDF
GTID:2568307100462094Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Visual Question Answering(VQA)is a transdisciplinary research problem that covers the fields of computer vision,natural language processing,and other areas and is one of the most compelling crossover applications available.Visual Question Answering tasks involve a deep understanding of images and text,knowledge representation,and intelligent reasoning,aiming to generate natural language answers corresponding to a given image and related natural language questions.Currently,the VQA model still presents some problems.On the one hand,most existing models for visual feature representation in VQA use region-based top-down feature representation or grid-based global visual feature representation,which either focus only on global features or local information of the image,resulting in either loss of context or inability to the global information is not fully utilized.On the other hand,for text feature representation in VQA models,most of the existing models are coarse-grained in information extraction,which leads to insufficient exploitation of the deeper semantics of text.At the same time,the traditional text and image feature fusion usually adopts the feature hierarchy fusion based on vector stitching,which also fails to solve the problem of possible information redundancy and conflict between different modalities.In response to the above problems,this reaserch proposes a multi-level visual feature enhancement method for visual question answering and a multi-granularity text representation and Transformer fusion method for visual question answering.The specific research contents are as follows:(1)To address the problem that visual features can not be fully utilized,this research uses a multi-level visual feature enhancement method.This method combines global(pixel-level)and local(object-level)features to learn multi-level visual representations above and below multiple spaces,and consists of two main modules: a graph-attention-based separated visual feature representation module,and a graphattention-based joint visual feature representation module.The graph attention-based separated visual feature representation uses two independent graph attention networks to learn global visual features and local visual features of an image,respectively.The graph attention network-based joint visual feature module uses a graph attention network to capture the semantic relationship between global features and local features and finally achieves the acquisition of global and local information.A gated fusion mechanism is added after the two modules to combine the shallow visual detail features with the deep semantic features to obtain multi-level visual representation information.It is experimentally verified that the multi-level visual feature enhancement method for VQA improves the accuracy of the system compared to the traditional method.(2)To address the problems of text feature loss and redundancy of information from fusing different features,this research adopts a multi-granularity text representation and a Transformer-based fusion method.For text loss,this method uses a multi-level expansive convolutional network to learn problematic feature representations from multiple semantic entities,where the results obtained from each layer of convolution can be used as problematic features at different granularities,resulting in the preservation of more informative features of the text.This research uses a Transformer-based multimodal fusion mechanism to address the problem of redundant and conflicting fusion information.This approach exploits the internal correlation between different modalities and dynamically calculates the weights between each modal feature to learn the information of different modalities simultaneously while keeping the contextual information unchanged.By using Transformer networks,the method can effectively capture the long-range dependencies between different modalities and thus better integrate the information of different modalities.The performance of the proposed method on the VQA2.0 dataset has been demonstrated to be superior through experiments.In summary,this research proposes a multi-level visual feature enhancement VQA method and a VQA method based on multi-granularity text representation and Transformer fusion,which solve the problems of visual feature representation,feature fusion and text feature representation in VQA method.The two methods proposed in this paper have been experimentally verified to have good performance.
Keywords/Search Tags:Visual Question Answering, Transformer, Graph Neural Networks, Multi-modal Feature Fusion
PDF Full Text Request
Related items