Font Size: a A A

Exploring Multi-Step Reasoning And Visual Localization In Video Question Answering

Posted on:2020-10-10Degree:MasterType:Thesis
Country:ChinaCandidate:X M SongFull Text:PDF
GTID:2518306518963039Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Video question answering(VideoQA)is a popular research topic in Multimedia and Computer Vision fields.The task targets to answer natural language questions based on given video contents.This is a multi-modal reasoning task,requiring QA models to understand the visual information and the semantic information,and fuse the two modes of information to reason.In recent years,some scholars began to pay attention to more complex VideoQA tasks and methods.One is the multi-step reasoning,that is,the QA model need to solve the problems including multiple logical operations,such as mathematical operations and attribute comparison.The other is the combination of VideoQA and visual localization,that is,the QA model need to locate the related video segment while giving the answer.This paper explores multi-step reasoning and visual localization tasks and methods in VideoQA.Reasonable and large amount of data is very important for the research and experiment of a task in the field of deep learning.Therefore,this paper constructs datasets for the two tasks to be studied.For the multi-step reasoning in VideoQA,we automatically synthesize the dataset SVQA(Synthetic Video Question Answering),the question of SVQA has a variety of logical operations,including the rich and complex spatial position relations and action timing relations of objects in videos.For the visual localization task in VideoQA,we manual annotate the real scene dataset Activity-QA,each QA pair in the dataset has a time stamp,which indicates that the QA pair is related to a segment in the video.In addition,this paper proposes VQA model and VQA-VE model based on deep learning.In the model,not only the video and the question are modeled,but also various attention mechanisms are used to fuse and infer multimodal information.VQA model uses spatial attention mechanism and ta-GRU which combined with temporal attention mechanism to conduct multi-step reasoning,while VQA-VE model uses visual attention mechanism and semantic attention mechanism to establish the correlation between questions semantics and video clips.The experimental results not only show the effectiveness of the two models in their respective tasks,but also show that the VideoQA model with multi-step reasoning and visual localization abilities can further improve the performance of traditional VideoQA tasks.
Keywords/Search Tags:Video Question Answering, Multi-step Reasoning, Visual Localization, Deep Learning, Dataset, Attention Mechanism
PDF Full Text Request
Related items