Font Size: a A A

Research On Video Captioning Method Based On Multi-Head Attention

Posted on:2020-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:K ShiFull Text:PDF
GTID:2428330623967000Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Video captioning involves two areas of computer vision and natural language processing,which is a very challenging task.At the same time,video captioning has a very range of practical applications.Currently,it is no doubt that describing video content completely by manual method can obtain very accurate results,but with the rapid growth of human resource cost the social demand for automatic video annotation method is becoming more and more urgent.In recent years,many video captioning models have adopted encoder-decoder frameworks widely used in the field of natural language processing.The encoding end uses the video feature sequence to generate the video intermediate vector representation,and then the decoding end makes use of intermediate vector representation to generate text description,the input and the output are processed in a sequence-to-sequence manner.The use of encoder-decoder framework has greatly promoted the development of video captioning research,but the current video captioning models still have many shortcomings.Firstly,many models lack the ability to focus critical information;Secondly,the input data of the video captioning model in training phase and testing phase are different,which leads to the exposure bias problem.Finally,the optimization target of the captioning model in the training process is the word level cross-entropy loss,which is inconsistent with the evaluation indicators of the gram level.In order to solve the above problems,this thesis proposes a multi-head attention based video captioning model.The model introduces multi-head attention mechanism into traditional encoder-decoder network,and optimizes model training method and model training target.The main research work is as follows:1.The introduction of a multi-head attention mechanism enables the model to acquire the ability to focus on the key information.When the model runs on the decoding end,multi-head attention mechanism can be used to get extra encoding information beyond decoder,and assign different weights to the encoding information of each step according to the correlation.2.The step-by-step mixing training method is proposed to solve the exposure bias problem.The video captioning model training process is divided into several stages,the noise data is used during training,and gradually increases the probability of using predictive sentences for training to make the training data close to the test data.3.Using reinforcement learning method to solve the problem of inconsistent training objective and evaluation indicators.The original training objective of the model is to maximize the probability sum of generating target sequences,by using reinforcement learning method,the model can conduct joint training for the evaluation indicator score and the target sequence probability sum.In order to verify the validity of the video captioning model,this thesis implements experiment on the MSVD dataset and the MSR-VTT dataset.The experimental results show that the proposed model effectively improves the effect of video captioning.
Keywords/Search Tags:Video captioning, Multi-head attention, Exposure bias, Step-by-step mixing training, Reinforcement learning
PDF Full Text Request
Related items