Font Size: a A A

A Research Of Video Question Answering Based On Deep Learning

Posted on:2019-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:J ChenFull Text:PDF
GTID:2348330563953960Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Recently,the cross domains of computer version and natural language processing become popular,i.e.,video captioning and Video Question Answering(Video-QA).Particularly,Video-QA,as a newly proposed research domain,attracts much attention from researchers.The goal of Video-QA is to predict accurate answer according to the given video and question.The types of Video-QA contain single word answer,open-ended answer,fill-in-the-blank and multiple choices.In this thesis,we propose methods to tackle Movie Fill-in-the-blank task(MovieFIB).In this task,the question is a description with one missing word.Video is a movie clip.MovieFIB task aims at accurately predicting the missing word in description.In previous works proposed to tackle MovieFIB task,the earliest method does not highlight the relationship between video and question.It just simply fuses the features of question and video,and then use the fused feature to predict missing word.The following method applies attention mechanism.However,it just give weights to frames of video to get question-level attention vector.Then,they use the vector to make prediction.To overcome the limitations mentioned above,we propose two models: hierarchical multilevel attention model and hierarchical multi-level multi-modal attention model.The first model uses both word-level attention and question-level attention.The word-level attention is used to update the video content which focuses on the significant contents between video and question.The question-level attention aims to update the question representation in order to make an accurate prediction.The later model adds frame-level attention and video-level attention into hierarchical multi-level attention model.The frame-level attention transforms the original description to a new description which emphasizes important parts of video.The video-level attention generates better video feature.Then,we use attention mechanism to fuse both video-level attention and question-level attention to predict blank word.We analysis the questions and answers in dataset,and we observe that not all the answer prediction needs visual content.For example,the description is “she sits on the chair”.If the blank word is “on”,we can predict it just using semantic information.If the missing word is “chair”,we must use the visual information to avoid making mistakes.Based on this observation,we propose a new model call adaptive temporal attention with description update.In this model,the adaptive temporal attention automatically gives weights to visual information and semantic information,which controls the proportion of visual information usage.The description update is used to update the original description to a new one which highlights important information of video.In this model,we use concatenation to process the question states from text before and after blank word.Due to the difference between these two types text,we should not treat them equally.Therefore,we propose a new model which uses attention mechanism to fuse two question states.We conduct extensive experiments to evaluate our proposed methods on MovieFIB dataset.All the experimental results demonstrate that our methods outperform previous works.
Keywords/Search Tags:Video Question Answering, Movie Fill-in-the-blank, Adaptive Temporal Attention Mechanism, Multi-level Attention Mechanism, Multi-modal Attention Mechanism
PDF Full Text Request
Related items