Font Size: a A A

Research On Semantic Guiding Video Captioning Methods With Attention Mechanism And Memory Network

Posted on:2020-10-04Degree:MasterType:Thesis
Country:ChinaCandidate:J YuanFull Text:PDF
GTID:2428330602950201Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
The task of describing video with natural language is called video captioning.It combines key technologies of natural language processing and computer vision.The research results promote the development of cross-modal analysis technology.In recent years,more and more researchers have been engaged in the research of video captioning.Generating video sentences is a complex task,which not only identifies different objects in a video and the interactions between them,but also describes the video content with natural language.Currently,most methods of video captioning are based on sequence learning approach,which first uses convolutional neural networks to extract the features of a video,and then uses recurrent neural networks to generate sentence descriptions from the visual features.In this paper,our approach is based on sequence learning method,our main contributions are summarized as follows:(1)We propose a video captioning method based on deep visual features and semantic attributes.Most existing video captioning methods only use the visual information of a video,but ignore the semantic information which is very important for the video description.Therefore,this method not only utilizes the visual information of videos,but also exploits the semantic information as the guiding information,when performing the video description.Firstly,the method uses two kinds of convolutional networks to extract features of single frame and successive frames of the video,respectively,and then averages those features to obtain visual object features and motion features of the video.Then,three types of semantic attributes are obtained from the sentence description of the training set,and each separate semantic attribute predictor is trained for each type of semantic attribute.Finally,we propose a semantic guiding long short-term memory networks,which uses semantic attributes to guide video description generation.This paper conducts experiments on the MSVD dataset,and the results are improved on many indicators compared with the state-of-the-art methods.(2)We propose a video captioning method that combines attention mechanisms and memory networks.In order to fully capture the object and motion information in the video,this method combines attention mechanisms and memory networks into semantic guiding long short-term memory networks.First,this method uses attention mechanism to selectivelyfocus on the most significant visual content,so that,the model will focus on the most significant objects and actions in the current time video.Then,this method increase the memory capacity of the memory cells in the long short-term memory networks by adding external memory networks,and the memory networks interact with the internal state of the long short-term memory networks through reading and writing operations.Finally,the output features of the attention mechanisms and the information read from the memory networks are input to semantic guiding long short-term memory networks for generating a video description.Extensive experiments are conducted on MSVD dataset,and the results show that our method is superior to the state-of-the-art methods.
Keywords/Search Tags:Video captioning, multi-feature representation, semantic attributes, attention mechanism, memory network
PDF Full Text Request
Related items