Video Question Answering Based On Spatio-temporal Attention Model

Posted on:2019-04-25

Degree:Master

Type:Thesis

Country:China

Candidate:K Gao

Full Text:PDF

GTID:2428330593451012

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In the computer vision and multimedia analysis fields,video analysis is a very important and challenging task.Video Question Answering(VQA),which is regarded as a medium of video analysis,has attracted lots of attention in recent years.The recent development of deep learning technologies has achieved successes in many visual and natural language processing(NLP)tasks.Deep convolutional feature has shown strong ability in several visual tasks.In addition,recurrent neural networks(RNNs),particularly LSTM,are being widely used in NLP field to deal with series of sequence problems.Recently,more and more researchers make attention to deeply understanding of visual content by jointly modeling visual and language information.VQA is the process of giving a suitable answer for the given videos and questions about the videos by obtaining their visual information and semantic information.Compared with the images,videos are the expansion with time.Because of the sequence clues of consecutive frames,VQA confronts many challenges,and the study of the VQA is still relatively rare.Inspired by the visual description and image question answering(IQA),in this paper,we propose two frameworks based on deep learning technologies.In the first study,we propose the spatio-temporal context networks for the VQA,and in the second study,we propose the initialized frame attention networks to deal with VQA tasks.In detail,in the first framework we proposed is mainly composed of two components: encoder part and decoder part.For encoder part,on the one hand,we generate scene feature by our designed scene model;on the other hand,motion model focuses on how to get temporal information from the consecutive frames.We utilize the optical flow as the motion weight to augment the action variation areas in videos.Next,we generate the scene representation and motion representation by two LSTM structures.For decoder part,we concatenate the scene and motion representations to initialize the language model.Finally,we choose the fully connected layer to aggregate semantic descriptions and use softmax classifier to choose the best answer.In the second framework we proposed also includes encoder part and decoder part.At frst,we generate the frame feature vectors by our designed frame model.Then in the encoder part,we use LSTM model to encode these frame features to generate the overall visual representation,and utilize this representation to initialize the language model of decoder part.Meanwhile,for decoder part,we integrate the frame attention mechanism to decode visual information of question-related frames.Finally,we use softmax classifier to choose the best answer.For these two proposed frameworks,we all conduct experiments on the public datasets and achieve good performances.

Keywords/Search Tags:

Video Question Answering, Deep Learning, LSTM, CNN

PDF Full Text Request

Related items

1	Intelligent Question Answering Of Deep Recurrent Neural Network Based On Self-Attention Mechanism
2	Video Question Answering Based On Deep Learning
3	Research And Implementation Of Intelligent Question Answering System Based On Deep Learning
4	Research On Deep Learning Algorithm For Automatic Question Answering
5	Research On Affective Visual Question Answering
6	Research On Question Answering Technology For Answering History Subject Question
7	Research And Application Of Deep Text Matching Algorithm In Question Answering System
8	Research On Semantic Relevance Calculation For Question Answering
9	Research On Deep Learning-based Multi-document Passage Ranking Methods For Question Answering System
10	Research On Automatic Question Answering Based On Large-scale Knowledge Graph