Font Size: a A A

Research On Video Captioning Methods Based On Encoder-decoder Structure

Posted on:2024-01-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:T Z NiuFull Text:PDF
GTID:1528307202961169Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Video captioning has become an active and thriving part of artificial intelligence field.This task requires not only accurate analysis and understanding of video content,but also the generation of fluent,accurate,and coherent description texts.Due to the fact that video contains rich and complex spatio-temporal information compared to picture and the inherent gap between visual and natural language,video captioning remains a challenging task.With the development of deep neural networks in recent years,encoder-decoder based deep neural network frameworks have gradually become the mainstream for video captioning.Despite recent promising achievements in this area,there are still many problems that need to be solved.1)For the encoder design,existing methods still employ a simplistic scheme to handling diverse types of features,which may be effective for videos with a relatively homogeneous and easily identifiable scene,however,when the video content is more complex and contains more than one object or when the video content is more difficult to recognize,more fine-grained processing of the visual features is required.2)In the domain of decoder design,the current RNN-based decoders can only use one or two layers of LSTMs or GRUs,and the information transfer between different layers is relatively single.However,for some videos that require complex linguistic structures to be described,a multilayer decoder is required to understand the linguistic information at different levels of the video.3)For the overall design of the video captioning model,it is equally worthwhile to investigate how to improve the consistency of visual and linguistic features in encoder design and use future word information in decoder design to improve the model performance.4)Although various offline-feature-extractors can provide information from different perspectives during video encoding,they have several limitations due to fixed parameters.Concretely,offline-feature-extractors are only pre-trained on image/video comprehension tasks,making them difficult to adapt the video captioning datasets.Furthermore,most of the current video description models based on encoder-decoder framework tend to ignore some shallow visual and textual information.To address the above problems,this thesis conducts an in-depth research on video captioning methods and designs several effective models.Specifically,the main work of this thesis include:(1)To address the problems of the encoder,we present a novel video captioning framework,Semantic Enhanced video captioning with Multi-feature Fusion,SEMF for short.It optimizes the utilization of various features through the implementation of three distinct modules.(2)To address the problems of the decoder,we propose a Multi-layer memory sharing Network,MesNet for short,which allows more layers to be stacked without compromising performance.In MesNet,we construct a novel memory sharing structure to strengthen the connections between layers and make the model easier to train.(3)To address issues of how to improve the consistency of visual and linguistic features in encoder design and how to use future word information in decoder design,we propose a novel video captioning model by learning from gLobal sEntence and looking AheaD,LEAD for short.Thereinto,Vision Module(VM)is a novel attention network,which can map visual features to high-level language space and improve consistency of visual and linguistic features;Language Module(LM)can not only effectively make use of the information of the previous sequence when generating the current word,but also have a look at the future word.In addition,we also propose an autonomous strategy and a multi-stage training scheme to optimize the model.(4)To address the problem of how to achieve end-to-end training for the encoderdecoder structure,we propose a novel End-to-end Video Captioning network with Multiscale Features,namely EVC-MF,which can efficiently utilize multi-scale features to generate video descriptions.We first directly input raw video frames into a transformerbased network to obtain multi-scale visual features.Then a mask encoder and an enhanced transformer-based decoder are used to efficiently utilize the multi-scale visual information and shallow text information.
Keywords/Search Tags:Video captioning, Encoder-decoder structure, Visual feature optimization, Multilayer decoder, Semantic consistency, End-to-end training
PDF Full Text Request
Related items