Research On Visual Information Enhancement For Visual Question Answering

Posted on:2021-04-17

Degree:Master

Type:Thesis

Country:China

Candidate:X M Li

Full Text:PDF

GTID:2518306017474714

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Visual Question Answering(VQA)is a deep learning model that automatically answers questions related to a given image or video based on calculations.The questions could cover a wide range and have various types,which makes VQA a pivotal role in scientific research and industrial development.VQA needs to consider both text and visual information during operation.It calculates and predict the answer to the question by reasoning based on fully perceiving the features embedded in texts and images.Although these two sub-tasks have developed greatly in their respective fields in the past few years,the development of VQA,as an interdisciplinary task,has gradually improved,but how to effectively improve the saliency of visual features still has a long way to go.Therefore,this paper focus on how to obtain effective visual features and improve the role of visual information in answering questions.This article explores this issue in depth from two perspectives:First,due to the uneven data distribution of the VQA dataset,the existing methods rely more on the priors of linguistic information to predict the answer to the question,which not only limits the answering ability of the model,but also reduces the reliability of the predicted answer.For the imbalance of text and visual information in VQA tasks,we propose a joint learning strategy to focus on increasing the proportion of visual features in the VQA prediction model.The joint learning strategy includes triple loss based on dynamic margins and multimodal embedding learning.The model is forced to focus on different visual features separately by jointly input "one question and two images".In this way,the proportion of visual information in answer prediction is increased,which leads to the improvement in both of the accuracy of answer prediction and the reliability of the model.Second,most existing VQA models extract visual information related to the question by visual attention mechanisms,however,the current deployment of visual attention mechanisms lacks direct supervisory information,which affects the accuracy of the extracted visual features.For the supervised learning of mechanism in VQA tasks,we propose a visual feature optimization method based on self-supervised target mechanism.Compared with the previous visual attention mechanism,our method takes the information from the answers into account when calculating the attention weight.This allows our model to effectively focus on some important features according to the answer during processing the visual features,which can further optimize the attention mechanism and improve performance of model accuracy.

Keywords/Search Tags:

Visual question and answer, visual information, joint learning, self-supervision, attention mechanism

PDF Full Text Request

Related items

1	Research On Visual Question Answer Algorithm Based On Attention Mechanism
2	Research On Visual Question Answering Models Based On Top-down Attention
3	Research On Visual Question And Answer Method Based On Supervised Learning
4	Research And Implementation Of Visual Question Answering System Based On Deep Learning
5	Research On Collaborative Attention Model And Deep Correlated Networks For Visual Question Answer
6	Research On Visual Question Answering Algorithm Based On Image Description And Multi-level Attention Mechanism
7	Research On Visual Question Answering Based On Visual Attention
8	Research On Multimodal Attention Mechanism And Information Fusion For Visual Question Answering
9	Research On Situational Reasoning Question Answer Method Based On Deep Learning
10	Research On Visual Question Answering Method With Visual Content Understanding And Text Information Analysis