Font Size: a A A

Research On Automatic Video Description Based On Deep Reinforcement Learning

Posted on:2019-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:W P DongFull Text:PDF
GTID:2348330569495562Subject:Engineering
Abstract/Summary:PDF Full Text Request
Video as the one of the most common multi-media in the Internet,is a content carrier for people to transmit information and share life.Video can deliver dynamic,richer content than static images.Therefore,the research on video content analysis has gradually become a hot topic in the field of computer vision research.The main job of the video auto-description task is to describe the video content in natural language.This task involves video image analysis and natural language processing and other related technologies.Video auto-description technology serves as a bridge between the conversion of video content and simple text.It can be widely used in real life.For example,add prior information for applications such as video search,video classification,etc.By combining speech synthesis technology,visually impaired people can more easily understand video content.Unlike the task of image description,video description need to dynamically acquire information from multiple frames to generate natural language descriptions.This requires not only the correct identification of objects in the video,but also its dynamic behavior.However,due to the fact that a large amount of video data on the Internet needs to be manually annotated,additional work on video description research is added,and because the existing methods cannot effectively model the dynamic characteristics of video,the video automatic description technology is still used in practical applications.These make video automatic description technology still has great challenges in practical applications.This paper studies the video description models proposed in recent years and analyzes the existing shortcomings in the prior models.Based on this,we proposes a novel regional attention video description model based on policy learning.The model is divided into Policy Location network and region attention network in two parts.(1).Selecting pre-trained convolutional neural networks such as VGGNet and C3 DNet on the initial encoding of the video to extract the deep feature of the video as a preliminary representation of the video information.(2).Using the policy gradient algorithm in reinforcement learning to adaptively learn the location policy to select multiple regions of the video frame for integration to generate an overall scene representation.(3).In order to preserve the dynamic information of the video,attention mechanisms are used to integrate the regional features of different frames in the temporal dimension to generate contextual features.(4).Contextual features are decoded through LSTM to get text sentences.Since the model is non-differentiable,the location policy and the overall network parameters can be updated by using supervised learning methods combined with reinforcement learning methods.Finally we evaluated our method on two large video benchmark datasets: MSVD and TACoS-MultiLevel.According to BLEU,METEOR,and CIDEr,our method is superior to other current advanced methods on these two data sets.
Keywords/Search Tags:video description, deep learning, reinforcement learning, location policy, attention mechanism
PDF Full Text Request
Related items