Font Size: a A A

Generating Descriptions For Videos With Attention Model

Posted on:2019-12-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z W YangFull Text:PDF
GTID:2428330593451030Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Exploring semantic correlation between visual content and natural language is a crucial challenge on multimedia content analysis and computer vision.Recently,the development of deep learning provides a strong technical support for the breakthrough of this problem.As consecutive visual content,videos deliver much information.Temporal and spatial structures of videos are the key points for video content understanding.Current video captioning methods based on deep learning design different deep network to encode video frames and incorporate their temporal-spatial structures.Differing from previous methods,this paper focuses on the applications of attention model on video captioning task.In this paper,two video captioning methods will be introduced,which attempt to automatically focus on some important video frame regions or video segments while generating video descriptions.The first method proposed in this paper considers salient video segments.It introduces the attention model into language model on video captioning.The method will adaptively select salient video segments for word prediction.It is evaluated on the popular benchmark MSVD.The experiments demonstrate that introducing temporal attention on sentence generating stage improves the performance of video captioning method.The second method considers regions of interest on single video frame and temporal dependencies between these regions.It utilizes attention model with the guidance of global feature to select regions of interest for each video frame.Besides,the paper designs a dual memory recurrent model to incorporate temporal dependencies of global features and interested region features respectively.This outputs a more discriminative video representation.The second method is also evaluated on MSVD and M-VAD dataset and gets great performance.
Keywords/Search Tags:Video Captioning, Deep Learning, Attention Model, RNN
PDF Full Text Request
Related items