Font Size: a A A

Video Summarization Via Semantic Attended Networks

Posted on:2020-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:H W WeiFull Text:PDF
GTID:2428330620460031Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Nowadays,with the continuous upgrading of storage hardware and the increasing speed of data transmission on the Internet,recording video becomes cheaper and faster.How to store and browse these large video data effectively is the key problem of video summarization technology.Video summary is to condense the video into a summary that can summarize the main content of the video.Generally,representative frames and fragments are extracted from the video,and then synthesized into a short video which can be browsed quickly,so as to reduce storage space and browsing time.The research in this area has aroused the interest of the vast number of scientific researchers,and many excellent algorithms have been proposed one after another.Redundancy in video generally includes visual redundancy and semantic redundancy.Visual redundancy refers to repeated pictures in video,and semantic redundancy refers to fragments unrelated to the main theme of video.In the past,some methods usually reduce the redundancy of visual content by maximizing the visual diversity of abstracts.The abstracts generated by these methods are not semantically compact and are not friendly to browsing.In order to solve the problem of semantic redundancy in video,this paper proposes a video summarization method based on attention mechanism and video description network.This paper designs a frame selector module embedded in attention mechanism and a video description network based on Encoder-Decoder.The frame selector is composed of a single-layer LSTM network.According to the visual content of each frame,the corresponding importance score is output.After multiplying the score with the original CNN feature,it will be input into the video description network.The video description network consists of an encoder composed of bidirectional LSTM network and a decoder composed of a single-layer LSTM network.This paper uses word vector technology to map text information.Under the supervision of given description text,frame selector calculates the importance score of each frame according to the semantic similarity between video content and supervised signal.The higher the score,the more semantically representative the frame is.Thus,the network can automatically locate video clips that are consistent with the semantics of the description text.On this basis,aiming at improving sparseness in the generated abstracts,two constraints are proposed: unsupervised L1 sparse constraints and supervised constraints,in which unsupervised L1 sparse constraints can make the network output more compact.Supervised constraints can embed the intrinsic statistical rules of video into the learning process of the network,which can make the network performance greatly exceed the frontier methods.In order to analyze the semantic information of video at a more "fine-grained" level,this paper also proposes a video summarization method based on semantic graph clustering.The semantic graph between video frames is constructed by using the semantic similarity between each word of video and supervised text,and then the graph clustering algorithm is used for alignment clustering analysis.Finally,the node with the largest degree in each subclass is taken as the most semantically representative key frame.This paper presents a video summarization method based on attention mechanism and video description network,which has achieved good results in some open datasets,which verifies the practicability and effectiveness of this method.
Keywords/Search Tags:recurrent neural networks, attention mechanism, encoder-decoder, video description, graph clustering
PDF Full Text Request
Related items