Font Size: a A A

Research On Video Captioning Algorithm Based On Attention Mechanism

Posted on:2022-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y H ZuFull Text:PDF
GTID:2518306725468984Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the advent of the information age,every Internet user is a producer and communicator of network information.This is an unprecedented situation.It is almost impossible to process and supervise all kinds of video data on the Internet manually.So,it is urgently to develop the video captioning technology to replace real people and understand video content.In order to solve the shortcoming of the current mainstream video captioning model,an attention-based video captioning model is proposed in this paper.The specific research content is as follows:Since the non-visual modal information in the video is not exploited sufficiently by the LSTM-based video captioning model due to the limitation of the structure,a video captioning model based on the visual-auditory dual-modal Transformer is proposed in this paper.Unlike LSTM model,the attention mechanism of the proposed model can make effective connections between two distant states.And,the proposed model introduces the auditory modal information in the video,and constructs dual-modal video features,so that the description sentence output by the model contains more comprehensive video information.As a result,the quality of the video captioning,especially for the auditory sensitive videos,is improved sharply.Furthermore,since the proposed model can accelerate the training rate by parallel structure,it is more efficient than the traditional LSTM model.Since the video content is too complex to be fully described in one sentence,a video event generation model is designed.Inspired by the YOLO target detection model,a convolution kernel with adaptive receptive fields is also designed in the convolution layer based on the K-Means algorithm.The designed convolution kernel with adaptive receptive fields detects video event sequences more accurately.And,the time position of the detected video event sequences can be used as the input of the proposed video captioning model to produce the dense video captioning.Compared with a single-sentence video captioning,the dense video captioning describes the video content more comprehensively and specifically.And,the generated dense video captioning can also be used to locate the position of the video event in time domain for the retrieval and searching function.
Keywords/Search Tags:Video captioning, Attention mechanism, Dual-modal characteristics, Transformer model
PDF Full Text Request
Related items