Research On Video Captioning Algorithm Based On Attention Mechanism

Posted on:2022-05-09

Degree:Master

Type:Thesis

Country:China

Candidate:Y H Zu

Full Text:PDF

GTID:2518306725468984

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

With the advent of the information age,every Internet user is a producer and communicator of network information.This is an unprecedented situation.It is almost impossible to process and supervise all kinds of video data on the Internet manually.So,it is urgently to develop the video captioning technology to replace real people and understand video content.In order to solve the shortcoming of the current mainstream video captioning model,an attention-based video captioning model is proposed in this paper.The specific research content is as follows:Since the non-visual modal information in the video is not exploited sufficiently by the LSTM-based video captioning model due to the limitation of the structure,a video captioning model based on the visual-auditory dual-modal Transformer is proposed in this paper.Unlike LSTM model,the attention mechanism of the proposed model can make effective connections between two distant states.And,the proposed model introduces the auditory modal information in the video,and constructs dual-modal video features,so that the description sentence output by the model contains more comprehensive video information.As a result,the quality of the video captioning,especially for the auditory sensitive videos,is improved sharply.Furthermore,since the proposed model can accelerate the training rate by parallel structure,it is more efficient than the traditional LSTM model.Since the video content is too complex to be fully described in one sentence,a video event generation model is designed.Inspired by the YOLO target detection model,a convolution kernel with adaptive receptive fields is also designed in the convolution layer based on the K-Means algorithm.The designed convolution kernel with adaptive receptive fields detects video event sequences more accurately.And,the time position of the detected video event sequences can be used as the input of the proposed video captioning model to produce the dense video captioning.Compared with a single-sentence video captioning,the dense video captioning describes the video content more comprehensively and specifically.And,the generated dense video captioning can also be used to locate the position of the video event in time domain for the retrieval and searching function.

Keywords/Search Tags:

Video captioning, Attention mechanism, Dual-modal characteristics, Transformer model

PDF Full Text Request

Related items

1	Research And Application Of Video Captioning Technology Based On Deep Learning
2	Video Captioning Algorithms Based On Multi-head Attention Mechanism
3	Research On Image Captioning Algorithm Based On Attention Mechanism
4	Spatio-temporal Attention Model For Video Captioning
5	Researches On Short Video Captioning Based On Deep Learning
6	Image-text Translation Based On Cross-modal Related Semantics And Attention Mechanism
7	Research On Semantic Guiding Video Captioning Methods With Attention Mechanism And Memory Network
8	Research On Visual Captioning Based On Deep Learning
9	Research On Video Captioning Based On Deliberation Mechanism
10	Video Summarization And Captioning Via Spatio-temporal Information And Deep Learning