Font Size: a A A

Video Captioning Algorithms Based On Multi-head Attention Mechanism

Posted on:2020-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:M ChenFull Text:PDF
GTID:2428330572967282Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The goal of the video captioning task is to give a video clip,and the algorithm automatically generates a descriptive text corresponding to the video content.The research content of this paper focuses on the description text generation of short video clips.Usually the short video clip contains only one action or event,and the generated description text is an English sentence.The mainstream video captioning model uses a recurrent neural network to learn the timing dependencies within the video feature sequence and the word sequence to obtain a vector representation of the video and text content.Due to the structural characteristics of recurrent neural networks,such models have the drawback that parallel computing cannot be performed and timing dependencies are not flexible enough.In order to improve the computational speed of the model and learn better timing dependencies,we propose a video captioning baseline model based on the multi-head attention mechanism,which can be paralleled and can obtain better vector representation of video and text content.In addition,at the data input level,since the video data contains multiple modal information,we propose a multi-modal feature fusion video captioning model based on the baseline model,which can adaptively control different modal features to generate words.The effect is to get a more natural description text containing more video content details and presentations.At the level of generalization and practicability of the model,due to the small size of the existing video captioning dataset and the limited types of video coverage,we propose a video captioning model based on semi-supervised learning based on the baseline model.The supervised short video data pre-training obtains a generalized video frame feature denoising encoder,and uses the pre-training model to improve the performance of the baseline model on the video captioning task.At the same time,the multi-task joint learning strategy is introduced,and the task of video frame feature denoising is used to regularize the task of video captioning,and the generalization performance of the video frame feature encoder is further improved.
Keywords/Search Tags:Video captioning, Multi-head attention, Multi-modal features, Semi-supervised learning
PDF Full Text Request
Related items