Font Size: a A A

Research On Multimodal Interaction Model And Optimization Method For Visual Question Answerin

Posted on:2023-10-28Degree:MasterType:Thesis
Country:ChinaCandidate:H YanFull Text:PDF
GTID:2568306797973329Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years,the popularization of mobile Internet and the significant improvement of computer performance,which promoted the effective collection,processing and storage of massive multi-media data,also promoted the rapid development of artificial intelligence.With the continuous improvement of the computing power of computers and servers,deep learning has attracted great attention of researchers.Deep learning has made innovative developments and breakthroughs in the fields of natural language processing and computer vision,further promoting the research of multimodal tasks.And visual question answering has attracted great attention of researchers and has gradually developed into one of the current hot research directions.The visual question answering task involves computer vision and natural language processing.It requires the integration of computer technologies such as visual analysis,language understanding,multimodal information fusion,and reasoning,which is complex and challenging.Existing methods,adopting various complex attention mechanisms and multimodal feature fusion methods,capture the high-level semantic interaction between two modalities.However,due to the language correlation in the training dataset and the attention mechanism that considers the relationship between individual words and visual regions ignores the context information to calculate the dependencies between the modalities,making the VQA model difficult to effectively infer the answer based on the given images.This paper mainly explores the information interaction between image and language in the visual question answering task and designs some methods to enhance the interaction between modalities and improve the reasoning ability of the VQA model.(1)A context-aware multimodal interaction network model was proposed.The attention mechanism is an important idea of the VQA method,which can effectively capture the information interaction between intra-and inter-modality.However,these attention mechanisms ignore the context information to calculate the dependencies between the modalities.For this problem,this paper proposes a context-aware multimodality interactive network,which enhances the modeling ability between intra-and inter-modality dependencies and improves the reasoning ability of visual question answering.The global context information summarizes the semantic information of each modality from a global perspective.(2)The analysis of the existing dataset in the training process reveals that there is still the influence of the linguistic prior,so a visual question answering optimization method based on self-contrast learning,which further optimizes the method proposed by(1),has been proposed.The attention mechanism can adaptively select important features,thereby effectively enhancing the interaction between visual and language modalities.However,due to the influence of language priors,the visual question answering model ignores the image information in the training process,resulting in the model directly generating answers based on the statistical priors of the training set.Therefore,we propose a novel self-contrastive learning method to solve this problem without introducing auxiliary tasks.Concretely,when the question pays attention to the question-relevant regions and the question-irrelevant regions,different answer spaces are generated to form a contrast to prevent the model from being driven by the surface language priors.Thus the question is forced to rely on the relevant image regions to predict the correct answer.In this paper,a series of experiments and analyses conducted on a large-scale benchmark dataset VQA v2.0 have verified the effectiveness of the model.The dependence of answers on image and the effectiveness of the optimization method without introducing additional annotation or auxiliary tasks has been verified by a series of comparative experiments on the reorganized VQA v2.0 dataset VQA-CP.
Keywords/Search Tags:deep learning, computer vision, natural language processing, visual question answering, attention mechanism
PDF Full Text Request
Related items