Exploring Multi-Step Reasoning And Visual Localization In Video Question Answering

Posted on:2020-10-10

Degree:Master

Type:Thesis

Country:China

Candidate:X M Song

Full Text:PDF

GTID:2518306518963039

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Video question answering(VideoQA)is a popular research topic in Multimedia and Computer Vision fields.The task targets to answer natural language questions based on given video contents.This is a multi-modal reasoning task,requiring QA models to understand the visual information and the semantic information,and fuse the two modes of information to reason.In recent years,some scholars began to pay attention to more complex VideoQA tasks and methods.One is the multi-step reasoning,that is,the QA model need to solve the problems including multiple logical operations,such as mathematical operations and attribute comparison.The other is the combination of VideoQA and visual localization,that is,the QA model need to locate the related video segment while giving the answer.This paper explores multi-step reasoning and visual localization tasks and methods in VideoQA.Reasonable and large amount of data is very important for the research and experiment of a task in the field of deep learning.Therefore,this paper constructs datasets for the two tasks to be studied.For the multi-step reasoning in VideoQA,we automatically synthesize the dataset SVQA(Synthetic Video Question Answering),the question of SVQA has a variety of logical operations,including the rich and complex spatial position relations and action timing relations of objects in videos.For the visual localization task in VideoQA,we manual annotate the real scene dataset Activity-QA,each QA pair in the dataset has a time stamp,which indicates that the QA pair is related to a segment in the video.In addition,this paper proposes VQA model and VQA-VE model based on deep learning.In the model,not only the video and the question are modeled,but also various attention mechanisms are used to fuse and infer multimodal information.VQA model uses spatial attention mechanism and ta-GRU which combined with temporal attention mechanism to conduct multi-step reasoning,while VQA-VE model uses visual attention mechanism and semantic attention mechanism to establish the correlation between questions semantics and video clips.The experimental results not only show the effectiveness of the two models in their respective tasks,but also show that the VideoQA model with multi-step reasoning and visual localization abilities can further improve the performance of traditional VideoQA tasks.

Keywords/Search Tags:

Video Question Answering, Multi-step Reasoning, Visual Localization, Deep Learning, Dataset, Attention Mechanism

PDF Full Text Request

Related items

1	Research On Visual Question Answering Based On Deep Neural Network
2	A Research Of Video Question Answering Based On Deep Learning
3	Research On Situational Reasoning Question Answer Method Based On Deep Learning
4	Visual Question Answering Based On Deep Reasoning
5	Research And Implementation Of Visual Question Answering System Based On Deep Learning
6	Research On Visual Question Answering Algorithm Based On Spatial Attention Reasoning Mechanism
7	Research On Collaborative Attention Model And Deep Correlated Networks For Visual Question Answer
8	Research And Algorithm Implementation Of Efficient Visual Question Answering Based On Deep Learning
9	Video Question Answering Based On Deep Learning
10	Fine-grained Visual Question Answering Based On Deep Learning