Font Size: a A A

Research On Visual Information Enhancement For Visual Question Answering

Posted on:2021-04-17Degree:MasterType:Thesis
Country:ChinaCandidate:X M LiFull Text:PDF
GTID:2518306017474714Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Visual Question Answering(VQA)is a deep learning model that automatically answers questions related to a given image or video based on calculations.The questions could cover a wide range and have various types,which makes VQA a pivotal role in scientific research and industrial development.VQA needs to consider both text and visual information during operation.It calculates and predict the answer to the question by reasoning based on fully perceiving the features embedded in texts and images.Although these two sub-tasks have developed greatly in their respective fields in the past few years,the development of VQA,as an interdisciplinary task,has gradually improved,but how to effectively improve the saliency of visual features still has a long way to go.Therefore,this paper focus on how to obtain effective visual features and improve the role of visual information in answering questions.This article explores this issue in depth from two perspectives:First,due to the uneven data distribution of the VQA dataset,the existing methods rely more on the priors of linguistic information to predict the answer to the question,which not only limits the answering ability of the model,but also reduces the reliability of the predicted answer.For the imbalance of text and visual information in VQA tasks,we propose a joint learning strategy to focus on increasing the proportion of visual features in the VQA prediction model.The joint learning strategy includes triple loss based on dynamic margins and multimodal embedding learning.The model is forced to focus on different visual features separately by jointly input "one question and two images".In this way,the proportion of visual information in answer prediction is increased,which leads to the improvement in both of the accuracy of answer prediction and the reliability of the model.Second,most existing VQA models extract visual information related to the question by visual attention mechanisms,however,the current deployment of visual attention mechanisms lacks direct supervisory information,which affects the accuracy of the extracted visual features.For the supervised learning of mechanism in VQA tasks,we propose a visual feature optimization method based on self-supervised target mechanism.Compared with the previous visual attention mechanism,our method takes the information from the answers into account when calculating the attention weight.This allows our model to effectively focus on some important features according to the answer during processing the visual features,which can further optimize the attention mechanism and improve performance of model accuracy.
Keywords/Search Tags:Visual question and answer, visual information, joint learning, self-supervision, attention mechanism
PDF Full Text Request
Related items