| With the rapid advancement of multimedia and network technology,a large amount of multimedia data such as images,videos,texts,and audio is rapidly emerging and growing,making the data present multi-modality characteristics.Nowadays,videos serve as the primary carrier of information transmission.Understanding video is significant for users to retrieve related videos,video recommendation systems,and public opinion monitoring.Video captioning is a task that generates corresponding descriptions for a video by understanding the actions and events happening in the video.The video captioning task is a typical visual-natural language cross-modal task that aims to convey video information by automatically generating multiple corresponding natural language sentences.The core of the video captioning task is to process the cross-modal data in the video,that is,visual information and text information,and deeply mine the semantic information contained in different modalities.The difficulty of the task is that on the one hand,the information of the two modalities is required to have semantic consistency,that is,the information of the two modalities of vision and text needs to be able to achieve information alignment at the semantic level.On the other hand,it is required that the text description of visual information not only needs to ensure the correctness of expression,but also must ensure the coherence of the text description.Aiming at these two difficulties,this thesis considers two perspectives as the cut-in,and conducts research on the video captioning task based on visual semantic information.One angle is to use self-supervised learning to model the video frames and text contained in the video in time sequence,and to obtain more accurate video captioning based on the fusion of multi-modal information and multilevel semantic alignment.Another perspective is to consider adding specific semantic information as an external knowledge to improve the performance of video content descriptions.This thesis proposes two video captioning algorithms,the main contents and contributions are as follows:1.Aiming at the requirement of semantic consistency in video captioning task,a video-language representation learning model with a trilinear structure is proposed.The model considers the use of automatically extracted dense captions as a supplementary text to the original ASR transcript.And a trilinear structure mode fusion encoder is used to realize the information interaction and fusion between video and language modes.At the same time,the dependency relationship within the modality is modeled to realize semantic alignment in the process of video-text multimodal information interaction.The proposed trilinear model provides better multimodal representations for video captioning,thereby improving the quality of video descriptions.Experimental results on multiple public datasets confirm the superiority of this model.2.Considering fully mining visual semantic clues to help understand video content,a video captioning model with a scenario-aware recurrent network structure is proposed.The model proposes a scenario prediction module to predict the context information of the video clip,and use the scenario semantic information as a local supervision signal to guide the model to generate description information that matches the current scene.At the same time,a memory unit with a recurrent structure is used to model historical information to achieve narrative coherence and accuracy for video descriptions of long texts.Thereby improving the performance of the video captioning task.Experiments show that this model can generate high-quality video descriptions,which also confirms the superiority of the model. |