Font Size: a A A

Video Highlight Detection Via Deep Learning

Posted on:2019-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y F JiaoFull Text:PDF
GTID:2428330566474035Subject:Engineering
Abstract/Summary:PDF Full Text Request
The emergence of wearable devices such as portable cameras and smart glasses makes it possible to record life logging first-person videos.However,people upload raw videos in a variety of time and places for different purposes,which makes most of the videos vary in length from a few minutes to a few hours and be full of noise.Browsing such long unstructured videos is time-consuming and tedious.Therefore,video highlight detection whose goal is to localize key elements(moments or short clips of major or special interest)in a video has been becoming increasingly important to alleviate this burden.Most of existing highlight detection approaches extract features from the video segment as a whole without considering the difference of local features both temporally and spatially.Due to the complexity of video content,this kind of mixed features will impact the final highlight prediction.Given a video,in temporal extent,not all frames are worth watching because some of them only contain the background of the environment without human or other moving objects.In spatial extent,it is similar that not all regions in each frame are highlights especially when there are lots of clutters in the background.To solve these problems,the contributions of this thesis can be concluded as follows.(1)We propose a novel deep ranking model based on local regions which can select key elements in each frame spatially.The proposed model produces position-sensitive score maps for each region,and aggregates all of position-wise scores using a Gaussian position-pooling operation.Then regions with higher response values will be extracted,which produces a better score by considering the key information within the local region for predicting video highlights.The proposed position-sensitive scheme can be easily integrated into an end-to-end fully convolutional network which aims to update parameters via stochastic gradient descent method in the backward propagation to improve the robustness of the model.(2)We propose a novel 3D(spatial and temporal)attention model which can automatically localize the key elements in a video without any extra supervised annotations.Specifically,the proposed attention model produces attention weights of local regions along both the spatial and temporal dimensions of the video segment.The regions of key elements in the video will be strengthened with large weights.Thus the more effective feature of the video segment is obtained to predict the highlight score.The proposed 3D attention scheme can be easily integrated into a conventional end-to-end deep ranking model which aims to learn a deep neural network to compute the highlight score of each video segment for highlight detection.(3)Extensive experimental results on the YouTube and SumMe datasets demonstrate that the proposed approaches achieve significant improvement over state-of-the-art methods.Specifically,with the proposed 3D attention model,video highlights can be accurately retrieved in spatial and temporal dimensions without human supervision in several domains.
Keywords/Search Tags:video highlight detection, deep ranking, attention model
PDF Full Text Request
Related items