Font Size: a A A

Research On Video Captioning Based On Deliberation Mechanism

Posted on:2021-01-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y H ZhangFull Text:PDF
GTID:2428330614471758Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
The video captioning task aims to use the computer to automatically generate human-readable natural language to describe the video content.In recent years,video captioning has been more and more widely used in the fields of human-computer interaction,fast video retrieval and visually impaired assistance systems.Currently,video captioning algorithms are mostly based on encoding-decoding frameworks.First,the encoder is used to extract visual information from the video sequence,and then the decoder sequentially decodes and generates descriptive sentences according to the visual features obtained by the encoding.Although such methods have been widely used in video captioning tasks and have achieved good performance,the one-step decoding method makes the generated sequence directly as the final output,lacking the overall consideration of the generated statement,as well as deliberation and deliberation process,and there are insufficient optimization of the generation description stage,such as exposure bias,inconsistent training and inferred evaluation indicators.In order to solve the above problems,this paper studies the video captioning algorithm,the main work is as follows:(1)A deliberation network(global-aware deliberation network,De_ga)based on global information is proposed,which introduces the deliberation mechanism into the traditional encoding-decoding framework,so that the model not only considers the preamble of the caption sentence when decoding the current word,also consider the follow-up information describing the sentence.The network has two layers of decoders,the first layer decoder generates the original video caption,and the second layer decoder polishes and refines the original video caption in a deliberative manner.Since the second layer decoder has global information that originally generated caption sentences,it can generate better semantic captions by using the future words in the original caption.(2)The soft attention mechanism is introduced on the basis of the global-local-aware deliberation network and propose a deliberation network(global-local-aware deliberation network,De_lga)based on global-local joint information.Through the soft attention mechanism,the second layer decoder can selectively focus on the global information generated by the first layer decoder and the more critical and important local words to further generate a better caption.(3)Aiming at the problem that the video captioning algorithm based on the classic encoder-decoder framework uses the traditional cross-entropy training model to produce exposure bias,inconsistent training and inferred evaluation indicators.This paper introduces a deep reinforcement learning algorithm based on Self-Critical Sequence Training(SCST)to improve the training methods of the two deliberative network models proposed above,directly optimize the evaluation indicators,and further improve the performance of the model.
Keywords/Search Tags:Video captioning, Deliberation mechanism, Attention mechanism, Reinforcement learning, Deep learning, Encoder-Decoder
PDF Full Text Request
Related items