Font Size: a A A

Spatio-Temporal Attention Networks For Video Question Answering

Posted on:2019-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:Q F YangFull Text:PDF
GTID:2428330548479802Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
For the purpose of visual information retrieval or building human-like smart agent,combining visual and natural language multimodal resources has always been a key topic of research in artificial intelligence.Video question answering is the essential problem in both visual information retrieval sites and smart assistant business,which helps them automatically feedback the most relevant answers or generate the natural language answer from the related visual information when the user asks questions.Lots of existing works take open-ended video question answering problem as the multimodal representation learning and understanding task.However,most of these works mainly focus on question answering for static visual content,which may lose effectiveness when extended to video question answering as a result of the inability to model temporal sensitive information of video contents.In this paper,we exploit the spatial-temporal attention network under the classical encoder-decoder learning framework to learn joint representation of variable-length dynamic video and question contents and then generate open-ended answers for tackling the video question answering problem.Utilizing spatial attention network,the proposed method can localize the most significant regions in each frame given the question.And temporal attention mechanism likewise extracts the collective frame information across the entire video for question answering.We then adapt attentional gated recurrent unit(GRU)networks to learn the sequence order of video frames.To further optimize the representation learning capability of our proposed method,we introduce multi-reasoning process to iteratively update the learned multi-modal features.A large-scale video question answering dataset is constructed to show the advantages of our proposed method.The extensive experiments which compare our algorithm with cutting edge enhanced visual question answering and video question answering approaches prove that the proposed method is much effective based on three major evaluation criteria.Furthermore,we implemented a fully functional video question answering system with Facebook Messenger Platform to demonstrate the strength of our model.The implemented system is publicly accessible.
Keywords/Search Tags:Video question answering, Representation Learning, Natural Language Understanding, Attention Mechanism
PDF Full Text Request
Related items