Research On Video Description Based On Deep Neural Networks

Posted on:2019-08-25

Degree:Master

Type:Thesis

Country:China

Candidate:C Y Li

Full Text:PDF

GTID:2428330566486041

Subject:Circuits and Systems

Abstract/Summary:

PDF Full Text Request

Video description is a novel research topic in the fields of computer vision and natural language processing,aiming to generate descriptive sentences for the target video automatically.In the early research work,the generation of video description generally uses the visual detector to capture the object information,scene information and the interaction relationships between objects in the video,then generating a descriptive sentence through a template-based language model.This method highly depends on the accuracy of the visual detector,while the description generated by the templates can only simply state the event without expressing rich semantic information.Since deep learning has made great breakthrough in the development of image classification,video recognition and machine translation,video description based on deep neural networks has attracted more and more researchers.In recent years,the encoder-decoder network framework has been commonly applied to solve the task of video description.In general,deep neural networks such as convolutional neural networks and recurrent neural networks are respectively utilized for visual feature encoding and decoding,and then the best descriptive sentence is selected as the final output by beam search algorithm.This paper mainly focuses on the study of video description methods based on deep neural networks.First of all,we elaborate the related theoretical knowledge of deep neural networks as well as the core technical issues in the video description direction.Secondly,we deeply study the temporal attention mechanism based video description method proposed by Yao et al.,and design three groups of comparison experiments to explore the effects of applying different training learning rate,batchsize and beam width on the description-generated model.On the basis of the method proposed by Yao et al.,we made a series of improvements and proposed a video description method based on a combination of rich semantic information and temporal-spatial attention mechanisms,which involves four improvements: 1.Integration with scene information and optical flow features,respectively representing the location information and behavioral change information of the video content.2.Embedding a bi-directional LSTM encoder to generate high-level semantic expressions by learning the contextual information of visual features from the past and future.3.Using a temporal-spatial attention mechanism,allowing the language model dynamically focusing on the key features within diverse frames on different subsets of videos while generating the current word.4.Beam search algorithm with length normalization.Experiments are conducted on MSVD and MSR-VTT video datasets,demonstrating the superior performance of our proposed approach to the temporal attention based video description method in several commonly used evaluation criteria,and obtaining equivalent assessment scores to the other mainstream methods.

Keywords/Search Tags:

Video description, Deep neural networks, Rich semantic information, Temporal-spatial attention mechanism, Length normalization

PDF Full Text Request

Related items

1	Video Summarization Via Semantic Attended Networks
2	Research On Image Description And Video Description Algorithm Based On Deep Learning
3	Design And Implementation Of Automatic Video Description Based On Deep Learning
4	Attention Mechanism Based Deep Network For Human Action Recognition In Video
5	Research On Deep Neural Networks Models For Image Captioning
6	Research On Video Person Re-identification Method Based On Spatial-temporal Attention Mechanism
7	Deep Learning Object Detection Based On Attention Mechanism
8	Attention Based Neural Networks For Biological Relation Extraction With Weakly Supervised
9	Research On Feature Fusion Strategies Of Attention Mechanism In Image Description
10	Temporal Action Localization In Massive Multimedia Video Scenario