Font Size: a A A

Research On Visual Question Answering Algorithm Based On Feature Fusion Of Attention Mechanism

Posted on:2022-12-18Degree:MasterType:Thesis
Country:ChinaCandidate:X D MengFull Text:PDF
GTID:2518306785460014Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
In recent years,as a complex learning task in the interdisciplinary field of computer vision and natural language processing,visual question answering has gradually become a research hotspot in the direction of cross-media expression and interaction due to its in-depth basic research and extensive application.Visual question answering involves the process of visual understanding and knowledge reasoning,that is,allowing the computer to automatically answer questions after learning to understand the image content and related question text information.Among them,visual/text feature extraction and multimodal feature fusion are particularly critical.However,due to insufficient feature extraction of text information and incompatibility of multimodal feature fusion,it is difficult for the current visual question answering model to better represent multimodal information features and understand the deep semantics of images,resulting in poor performance in visual question answering tasks.Therefore,this paper introduces the Transformer model in the visual question answering task,and learns a more effective multimodal feature fusion and semantic representation method by making full use of the visual content and positional relationship information,so as to achieve the purpose of improving the performance of the visual question answering model.The main work of this paper includes the following aspects:(1)A multi-modal feature fusion algorithm based on Transformer is proposed.In this paper,a multilevel feature fusion model based on Transformer is designed.Firstly,Faster R-CNN is used to extract visual features and embed location coding,while Glo Ve is used to extract problem text features,which are used as input of Transformer for first-level feature fusion to obtain "word-visual object" fusion features.In order to further enhance the fusion degree of image features and text features,the two-level fusion of visual feature information and first-level fusion multi-modal feature information is used to obtain the final fusion feature "phrase-visual object" fusion feature.In this paper,the effectiveness of the proposed feature fusion algorithm is verified on Flickr30 k Entities data set for phrase location task.Experimental results show that Transformer model has good performance in multi-mode feature fusion.(2)A research on visual question answering algorithm based on modulation detection is proposed.For the visual question answering task,this paper uses the previous Faster R-CNN model to extract visual features,and at the same time assists with location information to improve the accuracy of visual target positioning.In the text feature extraction part,in view of the shortcomings of the Glo Ve model,this paper uses the BERT pre-training model to extract the features of the text in question.A more detailed correspondence between images and visual information is obtained by secondary fusion in the Transformer-based attention mechanism model to improve the accuracy of visual question answering tasks.The role of modulation detection is also reflected here.Different from the traditional detection of all extraction of graphic and text features,modulation detection is to extract corresponding visual features for text information.On this basis,in order to further improve the accuracy of visual question answering,the question type features are extracted and the question type judgment is realized,and then the question type is combined with the candidate answer,and finally the visual question answering task is completed.The algorithm in this paper has obtained a high question-answer accuracy rate in the experimental comparison of multiple datasets of GQA,VQA2.0,and Flickr30 k Entities,which fully verifies the effectiveness of the algorithm in the visual question-answering task.(3)Design and development of visual question answering system.Based on the previous research on visual question answering algorithms,a visual questioning system was designed and developed,and the proposed algorithm model was used in practical visual question answering applications to realize the visualization of visual question answering results.The visual question answering system is expected to expand its application in the fields of intelligent voice assistant,enlightenment education,medical guidance,graphic verification code,etc.The system uses Pytorch+Django+My SQL for system development.The main functions include image and text processing,text retrieval,text description,visual question answering,etc.,and support user result feedback,which has good interactivity and practicability.
Keywords/Search Tags:Visual question answering, Attention Mechanism, Transformer, Multi-modality, Feature-fusion
PDF Full Text Request
Related items