Font Size: a A A

Research On Video Description Based On Deep Neural Networks

Posted on:2019-08-25Degree:MasterType:Thesis
Country:ChinaCandidate:C Y LiFull Text:PDF
GTID:2428330566486041Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
Video description is a novel research topic in the fields of computer vision and natural language processing,aiming to generate descriptive sentences for the target video automatically.In the early research work,the generation of video description generally uses the visual detector to capture the object information,scene information and the interaction relationships between objects in the video,then generating a descriptive sentence through a template-based language model.This method highly depends on the accuracy of the visual detector,while the description generated by the templates can only simply state the event without expressing rich semantic information.Since deep learning has made great breakthrough in the development of image classification,video recognition and machine translation,video description based on deep neural networks has attracted more and more researchers.In recent years,the encoder-decoder network framework has been commonly applied to solve the task of video description.In general,deep neural networks such as convolutional neural networks and recurrent neural networks are respectively utilized for visual feature encoding and decoding,and then the best descriptive sentence is selected as the final output by beam search algorithm.This paper mainly focuses on the study of video description methods based on deep neural networks.First of all,we elaborate the related theoretical knowledge of deep neural networks as well as the core technical issues in the video description direction.Secondly,we deeply study the temporal attention mechanism based video description method proposed by Yao et al.,and design three groups of comparison experiments to explore the effects of applying different training learning rate,batchsize and beam width on the description-generated model.On the basis of the method proposed by Yao et al.,we made a series of improvements and proposed a video description method based on a combination of rich semantic information and temporal-spatial attention mechanisms,which involves four improvements: 1.Integration with scene information and optical flow features,respectively representing the location information and behavioral change information of the video content.2.Embedding a bi-directional LSTM encoder to generate high-level semantic expressions by learning the contextual information of visual features from the past and future.3.Using a temporal-spatial attention mechanism,allowing the language model dynamically focusing on the key features within diverse frames on different subsets of videos while generating the current word.4.Beam search algorithm with length normalization.Experiments are conducted on MSVD and MSR-VTT video datasets,demonstrating the superior performance of our proposed approach to the temporal attention based video description method in several commonly used evaluation criteria,and obtaining equivalent assessment scores to the other mainstream methods.
Keywords/Search Tags:Video description, Deep neural networks, Rich semantic information, Temporal-spatial attention mechanism, Length normalization
PDF Full Text Request
Related items