Spatio-Temporal Attention Networks For Video Question Answering

Posted on:2019-04-20

Degree:Master

Type:Thesis

Country:China

Candidate:Q F Yang

Full Text:PDF

GTID:2428330548479802

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

For the purpose of visual information retrieval or building human-like smart agent,combining visual and natural language multimodal resources has always been a key topic of research in artificial intelligence.Video question answering is the essential problem in both visual information retrieval sites and smart assistant business,which helps them automatically feedback the most relevant answers or generate the natural language answer from the related visual information when the user asks questions.Lots of existing works take open-ended video question answering problem as the multimodal representation learning and understanding task.However,most of these works mainly focus on question answering for static visual content,which may lose effectiveness when extended to video question answering as a result of the inability to model temporal sensitive information of video contents.In this paper,we exploit the spatial-temporal attention network under the classical encoder-decoder learning framework to learn joint representation of variable-length dynamic video and question contents and then generate open-ended answers for tackling the video question answering problem.Utilizing spatial attention network,the proposed method can localize the most significant regions in each frame given the question.And temporal attention mechanism likewise extracts the collective frame information across the entire video for question answering.We then adapt attentional gated recurrent unit(GRU)networks to learn the sequence order of video frames.To further optimize the representation learning capability of our proposed method,we introduce multi-reasoning process to iteratively update the learned multi-modal features.A large-scale video question answering dataset is constructed to show the advantages of our proposed method.The extensive experiments which compare our algorithm with cutting edge enhanced visual question answering and video question answering approaches prove that the proposed method is much effective based on three major evaluation criteria.Furthermore,we implemented a fully functional video question answering system with Facebook Messenger Platform to demonstrate the strength of our model.The implemented system is publicly accessible.

Keywords/Search Tags:

Video question answering, Representation Learning, Natural Language Understanding, Attention Mechanism

PDF Full Text Request

Related items

1	Research On Visual Question Answering Method With Visual Content Understanding And Text Information Analysis
2	Research On Collaborative Attention Model And Deep Correlated Networks For Visual Question Answer
3	Research On Single-fact Knowledge Base Question Answering Based On Multi-aspect Attention Mechanism
4	Research On Attention Neural Network And Its Application In Natural Language Understanding
5	Research Of Question Answering Technology And Application Based On Natural Language Understanding
6	Research On Question Answering System Based On Understanding Of Chinese Natural Language
7	Research On Visual Question Answering Method Based On Attention Mechanism
8	A Research Of Video Question Answering Based On Deep Learning
9	Research On Question Answering System Based On Attention Mechanism And Answer Verification
10	Question Answering Model Based On Self-Attention Mechanism