Font Size: a A A

Automatic Captioning Of Single-event Videos Under Deep Learning Framework

Posted on:2018-04-18Degree:MasterType:Thesis
Country:ChinaCandidate:G LiFull Text:PDF
GTID:2348330542979621Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Because of the combination of two major artificial intelligence fields: computer vision and natural language processing,automatically generating a natural language description for open domain videos has attracted interests recently,where the promising progresses were obtained owing to the breakthroughs in deep neural networks.Different from the traditional SVO(subject,verb,object)based methods,in this paper,we propose two novel framework of video caption via deep neural networks.In first framework,for each frame,we extract visual features by a fine-tuned deep Convolutional Neural Networks(CNN),which are then fed into a Recurrent Neural Networks(RNN)to generate novel sentences descriptions for each frame.In order to obtain the most representative and high-quality descriptions for target video,a well-devised automatic summarization process is incorporated to reduce the noises by ranking on the sentence sequence graph.Moreover,our framework owns the merit of describing out-of-sample videos by transferring knowledge from precaptioned images.Experiments on the benchmark datasets demonstrate our method has better performance than the state-of-the-art methods of video caption in language generation metrics as well as SVO accuracy.In second framework,we aim to tackle the complex dynamics(variable length of frames and words)of open-domain videos.To approach this problem,we propose a novel end-to-end sequence-to-sequence model to generate captions for videos.For this we exploit recurrent neural networks,specifically LSTMs,which have demonstrated state-of-the-art performance in image caption generation.Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip.Our model naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences,i.e.a language model.We evaluate several variants of our model that exploit different visual features on a standard set of YouTube videos.
Keywords/Search Tags:Video Captioning, Deep Learning, RNN, CNN
PDF Full Text Request
Related items