Font Size: a A A

Video Summarization And Captioning Via Spatio-temporal Information And Deep Learning

Posted on:2018-10-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z GuoFull Text:PDF
GTID:2348330512488927Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the widespread availability of video cameras and smart phone,there are explosively increasing video content that is available on the web.The massive disorganized web video content may degrade user experience in that it is time-consuming and tedious to view the relevant videos and grasp the gist of their content,so that we urgently require some organization methods to provide a efficient and user-friendly way to manage and browse the tremendous amount of video data.There we introduce video summary and captioning framework that respond to this need.Our framework consists of two components: 1)a deep Convolutional Neural Network(CNN)encoder and 2)an attention based hierarchical Long Short Term Memory(LSTM)decoder.During encoder process,we firstly extract keyframes to represent the whole video.then we apply it to each frame to extract representative frame level features with a CNN architecture.In decoder process,we introduce LSTM which is avoid gradient vanishing problem in RNN model.Through the LSTM model,we can generate semantic sentence conditioned on feature vector generated by encoder process.The objective of video content summary is to produce a condensed and informative version for a given video with a set of interesting and representative frames or segments.Generally,depending on the specific application,the video content summary can be formulated as either keyframe extraction which extract a set of consists of a collection of salient images from the underling video source,or video skim generation which abstract consists of collection of video segments from the original video.A good video summary must embody at least two objectives.Firstly,it should contain the most interesting parts of original video e.g.In a jump video one doesn't want to miss highlights such as the start,jump in the air and landing.Secondly,the summary should be representative in keeping the diversity of the original while removing redundancy.To accomplish this,we utilize video' s salience cues,motion cues and a selection model to capture stable salience weight,discriminative weight and representative weight of a video shot respectively.We further combine these weights in a unified framework to predict the importance score of a shot,based on which,important shots are selected for the storyboard.Critically,the approach is neither camera wearer-specific nor object-specific;that means the learned importance metric need not be trained for a given user or context,and it can predict the importance of shots that have never been seen previously.Recent progress has been made in using attention based encoder-decoder framework for video captioning.However,there still are several problem at present: 1)most existing decoders apply the attention mechanism to every generated word,ignoring the relationship between video content and the semantics of the entire sentence.To address this issue,we propose a novel Attention-based LSTM with Semantic Consistency(aLSTMs)approach for video captioning.2)most existing decoders apply the attention mechanism to every generated word including both visual words(e.g.,“gun” and “shooting”)and non-visual words(e.g.“the”,“a”).However,these non-visual words can be easily predicted using natural language model without considering visual signals or attention.Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of video captioning.To address this issue,we propose a hierarchical LSTM with adjusted temporal attention(hLSTMat)approach for video captioning.Specifically,the proposed framework utilizes the temporal attention for selecting specific frames to predict the related words,while the adjusted temporal attention is for deciding whether to depend on the visual information or the language context information.Also,a hierarchical LSTMs is designed to simultaneously consider both low-level visual information and high-level language context information to support the video caption generation.To demonstrate the effectiveness of our proposed framework,we test our method on two prevalent datasets: MSVD and MSR-VTT,and experimental results show that our approach outperforms the state-of-the-art methods on both two datasets.
Keywords/Search Tags:Deep Learning, Attention Mechanism, Adjusted Mechanism, Video Summary, Video Captioning
PDF Full Text Request
Related items