Font Size: a A A

Video Question Answering Based On Spatio-temporal Attention Model

Posted on:2019-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:K GaoFull Text:PDF
GTID:2428330593451012Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the computer vision and multimedia analysis fields,video analysis is a very important and challenging task.Video Question Answering(VQA),which is regarded as a medium of video analysis,has attracted lots of attention in recent years.The recent development of deep learning technologies has achieved successes in many visual and natural language processing(NLP)tasks.Deep convolutional feature has shown strong ability in several visual tasks.In addition,recurrent neural networks(RNNs),particularly LSTM,are being widely used in NLP field to deal with series of sequence problems.Recently,more and more researchers make attention to deeply understanding of visual content by jointly modeling visual and language information.VQA is the process of giving a suitable answer for the given videos and questions about the videos by obtaining their visual information and semantic information.Compared with the images,videos are the expansion with time.Because of the sequence clues of consecutive frames,VQA confronts many challenges,and the study of the VQA is still relatively rare.Inspired by the visual description and image question answering(IQA),in this paper,we propose two frameworks based on deep learning technologies.In the first study,we propose the spatio-temporal context networks for the VQA,and in the second study,we propose the initialized frame attention networks to deal with VQA tasks.In detail,in the first framework we proposed is mainly composed of two components: encoder part and decoder part.For encoder part,on the one hand,we generate scene feature by our designed scene model;on the other hand,motion model focuses on how to get temporal information from the consecutive frames.We utilize the optical flow as the motion weight to augment the action variation areas in videos.Next,we generate the scene representation and motion representation by two LSTM structures.For decoder part,we concatenate the scene and motion representations to initialize the language model.Finally,we choose the fully connected layer to aggregate semantic descriptions and use softmax classifier to choose the best answer.In the second framework we proposed also includes encoder part and decoder part.At frst,we generate the frame feature vectors by our designed frame model.Then in the encoder part,we use LSTM model to encode these frame features to generate the overall visual representation,and utilize this representation to initialize the language model of decoder part.Meanwhile,for decoder part,we integrate the frame attention mechanism to decode visual information of question-related frames.Finally,we use softmax classifier to choose the best answer.For these two proposed frameworks,we all conduct experiments on the public datasets and achieve good performances.
Keywords/Search Tags:Video Question Answering, Deep Learning, LSTM, CNN
PDF Full Text Request
Related items