Font Size: a A A

Enhanced Visual Feature For Visual Question Answering

Posted on:2021-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:S J QinFull Text:PDF
GTID:2428330605955630Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Visual Question Answering is a typical multi-modal task and an important research topic.This task combines the computer vision and natural language processing and requires the simultaneous processing and fusion of images and text.Therefore,the information representation and fusion of multi-modal features play a key role in the performance of the visual question answering model,which has attracted a lot of attention and proposed many schemes.Through a review of the existing models,it was found that these models still have some deficiencies in understanding sentence semantics and focusing on key areas related to images,which affects the performance of the visual question answering model.In this thesis,an enhanced visual feature is proposed to address these shortcomings,and the performance of the model is improved through the improvement of the image features.The research is as follows:(1)A multi-modal fusion model based on the joint attention mechanism and enhanced visual features is proposed,which achieves a fine-grained representation of feature information.The enhanced visual features are obtained by combining spatial features and object features.In addition,a Bidirectional Long-Short Term Memory network is used to implement a self-attention mechanism for the question itself and to focus on important areas in the visual features based on the keywords in the question.Finally,a Multi-modal Factorized Bilinear Pooling model is used as the fusion model for image and text features.The effectiveness of the proposed model is verified through the implementation of visual question-and-answer tasks,and a lot of comparative experiments and analysis are conducted.By comparing with the existing baseline models and the state-of-the-art models,the model demonstrates the best performance on the GQA dataset,and also proves that enhanced visual features can effectively improve the performance of model.(2)For the Modular Co-attention stacking model,we propose to add the position coordinate information corresponding to each object in the object features of the image as an enhanced visual feature.Compared with the case of using the object feature of the image alone,the enhanced visual feature contains more fine-grained information,that is,information about the absolute position of the object,which makes the model more accurate when focusing on relevant regions of the image.The stacking model contains self-attention units and guided attention units that implement joint attention learning on images and texts.The implementation of the improved model on the VQA-v2 dataset for visual question answering.Through a large number of experiments and comparison with related baseline models and state-of-the-art models,the results demonstrate that the model with enhanced visual features achieves optimal performance.The validity of the research in this paper is again validated.
Keywords/Search Tags:Visual Question Answering, Multi-modal Information Fusion, Enhanced Visual Feature, Object Location Information
PDF Full Text Request
Related items