Font Size: a A A

Video Question Answering Based On Deep Learning

Posted on:2020-12-13Degree:MasterType:Thesis
Country:ChinaCandidate:L L LiangFull Text:PDF
GTID:2428330575491185Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Given a short video and a natural language description,the video question answering(Video QA)system needs to answer the questions which are based on understanding the content of video and question.As the Video QA task becomes the focus of researchers in the field of computer vision and natural language processing in recent years,the attention mechanism method gradually becomes one of the important methods to study Video QA.However,there are three disadvantages to the current methods.Firstly,only using frame features to describe the video,the current methods neglect the temporal information in the video.Secondly,the questions are not dealt with before the model training,because the stop-words in the questions cannot describe the video.Thirdly,the complexity and the logic of the Video QA task is not considered which requires many times of attention and reasoning.Therefore,the generalization performance and the accuracy of the model is reduced.Multi-Stage Attention Mechanism(MSAM)model is proposed based on deep learning in this paper.The model consists of three stages.The first stage is the attention mechanism model in the temporal dimension.The video sequence is processed.Key videos related to the questions are identified according to the sequence of concerned words,which contains key frames and key clips.The second stage is the attention mechanism model in the spatial dimension.The attention inquiry mechanism focuses on the video key frames or key clips according to the filtering stop-words and highlights the areas related to the question.The third stage is the attention mechanism model in the temporal and spatial dimension.The hidden state of LSTM which is contained the information about the question part.Both the information of question and the features of key videos are fused which input to the Bi-LSTM network.The output information of the Bi-LSTM network is used to perform Third attention operation the intermediate video representation.The validity of the area information in the second stage is evaluated based on the result of the attention.The information is obtained which is significant to answer the question.This paper proposes a Video QA method based on Multi-Stage Attention Mechanism Network(MSAMN).Firstly,we select to use ResNet as the framelevel feature extractor and use TSN as the clip-level feature extractor.Secondly,encode the question and the tags of an answer to use LSTM and CNN as a feature extractor,respectively.Thirdly,the features of videos and questions are input to the MSAM model.The MSAM model outputs significant features related to the predicted answers.Finally,the predict answers are generated based on feature fusion and multi-step reasoning.The MSAMN model is evaluated on the Tianchi ZJL open dataset,which achieves a high accuracy rate.
Keywords/Search Tags:video question answering, deep learning, feature fusion, multi-stage attention mechanism
PDF Full Text Request
Related items