Automatic Captioning Of Single-event Videos Under Deep Learning Framework

Posted on:2018-04-18

Degree:Master

Type:Thesis

Country:China

Candidate:G Li

Full Text:PDF

GTID:2348330542979621

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Because of the combination of two major artificial intelligence fields: computer vision and natural language processing,automatically generating a natural language description for open domain videos has attracted interests recently,where the promising progresses were obtained owing to the breakthroughs in deep neural networks.Different from the traditional SVO(subject,verb,object)based methods,in this paper,we propose two novel framework of video caption via deep neural networks.In first framework,for each frame,we extract visual features by a fine-tuned deep Convolutional Neural Networks(CNN),which are then fed into a Recurrent Neural Networks(RNN)to generate novel sentences descriptions for each frame.In order to obtain the most representative and high-quality descriptions for target video,a well-devised automatic summarization process is incorporated to reduce the noises by ranking on the sentence sequence graph.Moreover,our framework owns the merit of describing out-of-sample videos by transferring knowledge from precaptioned images.Experiments on the benchmark datasets demonstrate our method has better performance than the state-of-the-art methods of video caption in language generation metrics as well as SVO accuracy.In second framework,we aim to tackle the complex dynamics(variable length of frames and words)of open-domain videos.To approach this problem,we propose a novel end-to-end sequence-to-sequence model to generate captions for videos.For this we exploit recurrent neural networks,specifically LSTMs,which have demonstrated state-of-the-art performance in image caption generation.Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip.Our model naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences,i.e.a language model.We evaluate several variants of our model that exploit different visual features on a standard set of YouTube videos.

Keywords/Search Tags:

Video Captioning, Deep Learning, RNN, CNN

PDF Full Text Request

Related items

1	Research On Video Captioning Based On Deep Learning
2	Video Summarization And Captioning Via Spatio-temporal Information And Deep Learning
3	Automatic Captioning Of Single-event Videos Under Deep Learning Framework
4	Research And Application Of Video Captioning Technology Based On Deep Learning
5	Video Captioning Based On Deep Learning
6	Research On Visual Captioning Based On Deep Learning
7	Deep Learning-Based Fine-Grained Sports Video Captioning Research
8	Research And Implementation Of Long Video Captioning Technology Based On Deep Learning
9	Research On Multi-feature And Multi-modal Video Captioning Based On Deep Learning
10	Research On Video Captioning Based On Deliberation Mechanism