Font Size: a A A

Research And Application Of Video Captioning Technology Based On Deep Learning

Posted on:2020-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y W WeiFull Text:PDF
GTID:2518306500987099Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Video captioning is a technique that uses computer models to generate corresponding text annotations for specific targets and scenes in a video.It involves understanding objects,people,scenes,events,time relationships,and many other aspects.Attention based encoder-decoder models have shown great success on video captioning.Recent multi-modal video captioning mainly focused on applying the attention mechanism to all modalities and fused them in the same level.However,the connections among specific modalities have not been investigated in the fusion process.In this paper,the expressivity of uni-modal is firstly investigated.Considering the characteristic of attention mechanism,an instance level of visual content is exploited to refine the temporal features.Then,a CNN+RNN based semantic detection architecture is also employed on the spatiotemporal content to exploit the correlations between semantic labels for better video semantic representation.Finally,a hierarchical attention-based multimodal fusion model for video captioning is proposed by jointly considering the intrinsic properties of multimodal features.Experimental results on the MSVD and MSR-VTT datasets show that the proposed method has achieved best performance compared with the state-of-the-art video captioning methods.This paper focuses on how to train the full convolution semantic segmentation network to label the target regions in the regional proposal features,find the corresponding feature regions in the target features through one-to-one coordinate mapping,and combine to generate the final feature matrix.The visual feature extraction method in this paper can not only solve the problem of background feature redundancy,but also make up for the defect of feature map segmentation in object-based integrity in the attention-based method.The experiment proves that the visual feature extraction method based on full convolution semantic segmentation proposed in this paper is 1.5%-2% higher than the traditional method in many evaluation standards.This paper intends to use the CNN+RNN framework to implement the multi-label extraction task of videos.Through the timing sensitivity of the recurrent neural network(RNN),the model can effectively learn the connections between different tags in the videos and improve the quality of generating high-level semantic information.The experiment proves that the high-level semantic labels learned by CNN+RNN framework can improve the model effect by 1%-2%.Based on the traditional attention mechanism and multi-modal fusion technology,we propose a modal fusion method based on hierarchical attention.By layering different modal information,the traditional single-layer modal fusion method cannot effectively distinguish the different modal differences and connections,and finally achieve the purpose of improving the quality of video subtitles.Experiments show that the modal fusion method based on hierarchical attention can generate more accurate video subtitles,and the traditional modal fusion method is 1%-1.5% higher than the evaluation criteria of BLEU and CIDER.Besides the above innovations in algorithm,the following important tasks have been completed.First,we provide sufficient training and validation set samples through data collection and collation.Secondly,the machine learning framework is used to encode several algorithm models which are proposed in this paper.Then,the training set samples are input into the complete algorithm model for training and tuning.Finally,the compiled verification set samples are input into the trained model for performance testing.
Keywords/Search Tags:Video captioning, Multi-modal, Attention mechanism, RPN
PDF Full Text Request
Related items