Font Size: a A A

Video Question Answering Based On A Forget Memory Network

Posted on:2019-09-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y GeFull Text:PDF
GTID:2428330593951035Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
In recent years,with the in-depth study of deep learning techniques,computer vision and natural language processing areas have made great progress.In computer vision area,image classification,object detection,action recognition and video classification have got good performance in some open relevant datasets.In natural language processing area,text classification,language model,speech recognition and machine translation also have achieved good performance.And they have been put into use.The advent of the age of artificial intelligence needs the machine have the cognitive ability.The machine which has the cognitive ability not only has to have good visual recognition ability,but also they have to recognize some natural language just like human.Video question answering task has received much attention in recent years.It combines the fields of computer vision and natural language processing and can answer some relevant questions according to the visual content of the video clips.Due to the complexities of the video data,only a few methods have been proposed for video question answering.Compared to image question answering task,video question answering task faces more challenges.It needs to exploit the temporal information of the video clips.What's more,most regions of the video frame are not relevant to the question.These irrelevant regions can be considered as noise.How to find the useful region features for the question,forget the irrelevant region features,fuse the video and text features,and exploit the temporal information of the video are problems needed to be solved.To solve these problems,we propose a forget memory network to solve video question answering task.For a sequence of video frames extracted from the video clip,we use convolutional neural networks to extract the video frame features.Then according to the relevant question,we use the forget memory network to select the useful region features for the question,and forget the irrelevant region features.If the video clips have some text descriptions,we can use the forget memory to process text information.Then the video features and text features are fused together.We use the fused video and text features to solve the video question answering.We also extend the forget memory network.When we use the forget memory network get video frame features,we input a sequence of frame features into a gated recurrent unit model.And we use the output features of the last timestep to represent the video features.It can get more spatial information of the video.Our proposed approaches achieve good performance on the MovieQA[1]and TACoS[2]datasets.
Keywords/Search Tags:Image Question Answering, Video Question Answering, Convolutional Neural Network, Forget Memory Network
PDF Full Text Request
Related items