Font Size: a A A

Video Captioning With Adversarial Reinforcement Learning

Posted on:2021-02-15Degree:MasterType:Thesis
Country:ChinaCandidate:X Y YinFull Text:PDF
GTID:2428330602983770Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In modern society,thanks to technological progress,a large amount of video data is uploaded to mobile devices and social media all the time.Faced with such a huge amount of video data,no one,or any group,can fully browse the content.How to process and understand these contents have become a pressing problem,so video captioning has attracted the attention of experts and scholarsTwo major challenges remain in video caption task:1)In order to understand the content of the video,the computer must be able to understand the information contained in each frame and the temporal relationship between the frames.A video contains rich and complex information,such as more objects,actions,interactions,etc.In addition,the inherent temporal dynamics of video reflect the interaction between subjects and the movement track of subjects.Therefore,getting the computer to fully understand the information presented in the video is extremely important.2)The sentence generated by the computer should be consistent with the sentence described by the person,that is,the generated sentence should be semantically correct and linguistically natural.Semantic correctness is the most basic requirement of video caption task,in order to keep highly consistent with the video content.Describing nature is to satisfy the expression habit of human,so that the generated description has better readabilityTo address the above issues,this thesis presents a new video captioning method based on adversarial reinforcement training,named VICTOR.Specially,a generator,a discriminator,and training strategy are designed.The generator is designed to generate captions,including encoder,decoder and reconstructor.Among them,the encoder uses a CNN+RNN architecture to extract abstract features of the video.The decoder utilizes the local and global information of text to generate a comprehensive description.In addition,in order to enhance the correlation between the generated sentence and the original video,we additionally reconstitute the video feature sequence from the output of the decoder.The purpse of the discriminator is to determine whether the caption is derived from synthetic or annotated data,and to score each word in the generated caption.In terms of training methods,the adversarial reinforcement policy enhances the smoothness and naturalness of the resultsThe main contributions of this thesis are summarized as follows·This thesis proposes a video caption method with adversarial reinforcement strategy,dubbed as VICTOR.It can mine the rich information hidden in the video and translate the video into natural language.In addition,specially designed training strategy meets both correct and natural requirements.·A special decoder is designed to make full use of the local and global information of text.The decoder consists of two layers:the bottom layer focuses on language modeling at the local word level;the top one focuses on language modeling at the video content and global sequence level·Experimental results on common benchmark datasets,namely MSVD,MSR-VTT,and Charades show that this method has better performance in video caption task and outperforms some other video caption methods.
Keywords/Search Tags:Video captioning, Adversarial training, Reinforcement learning, Generative adversarial network
PDF Full Text Request
Related items