Font Size: a A A

Video Question Answering Based On Deep Attention And Deep Fusion

Posted on:2021-02-13Degree:MasterType:Thesis
Country:ChinaCandidate:M ZhangFull Text:PDF
GTID:2428330623469128Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nowadays,data increasingly appear in an unstructured form.Video has become the main carrier of information.It is very challenging to automatically analyze massive videos and obtain useful information from them.Video question answering,as one of the most measurable directions for visual semantic understanding,has received widespread attention from researchers.Its goal is to understand the current video and give reasonable answers given the natural language question.It undoubtedly can offer tremendous help for visual semantic understanding of massive videos.The current mainstream methods for video question answering use deep neural networks.The basic components include convolutional neural network,recurrent neural network,and attention mechanism.However,the existing models cannot make full use of text information,and their attention mechanism is more inclined to video features,and the fusion mechanism cannot fully fuse the multimodal features of text and video.This thesis proposes a video question answering model based on deep attention and deep fusion.Deep attention is a new method of attention calculation,which can effectively construct a frame-word-level attention map of multiple glimpses,so as to efficiently obtain the correlation weight between frames and words,and the number of parameters required to build an attention map decrease a lot.Deep fusion is a multi-output model structure based on residual learning.It can more effectively use the attention map information of multiple glimpses,and introduces the Refine module to continuously optimize the fusion features to make its information more targeted.We did many comparative experiments on the three datasets to verify the effectiveness of the model.The experimental results show that the algorithm proposed in this thesis can better solve the problem of video and text fusion,and the effect is significantly improved on three challenging datasets,thus proving the effectiveness of the algorithm.
Keywords/Search Tags:Video Question Answering, Attention Mechanism, CNN, RNN
PDF Full Text Request
Related items