Video Captioning Algorithms Based On Multi-head Attention Mechanism

Posted on:2020-06-21

Degree:Master

Type:Thesis

Country:China

Candidate:M Chen

Full Text:PDF

GTID:2428330572967282

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

The goal of the video captioning task is to give a video clip,and the algorithm automatically generates a descriptive text corresponding to the video content.The research content of this paper focuses on the description text generation of short video clips.Usually the short video clip contains only one action or event,and the generated description text is an English sentence.The mainstream video captioning model uses a recurrent neural network to learn the timing dependencies within the video feature sequence and the word sequence to obtain a vector representation of the video and text content.Due to the structural characteristics of recurrent neural networks,such models have the drawback that parallel computing cannot be performed and timing dependencies are not flexible enough.In order to improve the computational speed of the model and learn better timing dependencies,we propose a video captioning baseline model based on the multi-head attention mechanism,which can be paralleled and can obtain better vector representation of video and text content.In addition,at the data input level,since the video data contains multiple modal information,we propose a multi-modal feature fusion video captioning model based on the baseline model,which can adaptively control different modal features to generate words.The effect is to get a more natural description text containing more video content details and presentations.At the level of generalization and practicability of the model,due to the small size of the existing video captioning dataset and the limited types of video coverage,we propose a video captioning model based on semi-supervised learning based on the baseline model.The supervised short video data pre-training obtains a generalized video frame feature denoising encoder,and uses the pre-training model to improve the performance of the baseline model on the video captioning task.At the same time,the multi-task joint learning strategy is introduced,and the task of video frame feature denoising is used to regularize the task of video captioning,and the generalization performance of the video frame feature encoder is further improved.

Keywords/Search Tags:

Video captioning, Multi-head attention, Multi-modal features, Semi-supervised learning

PDF Full Text Request

Related items

1	Research On Video Captioning Method Based On Multi-Head Attention
2	Research On Multi-Modal Video Captioning
3	Research And Application Of Video Captioning Technology Based On Deep Learning
4	Research On Multi-feature And Multi-modal Video Captioning Based On Deep Learning
5	Video Sentiment Analysis Based On Multimodal Fusion
6	Research On Multi-modal Learning For Imbalanced Modal Data
7	Research Of Video Captioning On Egocentric Videos
8	Research On Social Image Captioning Based On Deep Learning
9	Video Dense Event Description Text Generation Based On Multi-head Self-attention Mechanis
10	Cross-modal Retrieval And Annotation Based On Hashing Learning Method