With the rapid development of the Internet and multimedia technology,video data has exploded and occupied an important position in massive multimedia data.Given a video,most people can easily obtain a lot of information from it and explain or describe its content to different degrees.However,it is very challenging for machines to extract information from videos and generate sentence descriptions.In recent years,with the upsurge of deep learning,video captioning has attracted more and more experts and scholars in the field of language and vision research.Although significant progress has been made in the research of video captioning methods,this task still faces many challenges due to the inherent multimodal nature of videos and the semantic gap between different modalities:(1)To generate accurate and comprehensive sentences,features such as appearance,motion and audio of videos are very important.However,most existing methods simply concatenate different types of features and ignore the interactions between them.(2)There is a huge semantic gap between visual feature space and semantic embedding space,making it difficult to explore the correlation and compatibility between them,which makes the task of video captioning much difficult.(3)Most methods only consider the visual and textual modalities of the video,while ignoring the audio modalities,resulting in the model being more sensitive to audio-related scenes.Motivated by the above observations,the author proposes two video captioning methods based on semantic embedding guided attention(SEGA)with explicit visual feature fusion(EVF),and semantic embedding guided attention with explicit visual-audio feature fusion(EVAF).Among them,explicit visual feature fusion(EVF)and explicit visual-audio feature fusion(EVAF)can be collectively referred to as explicit feature fusion(EF).Firstly,the author designs an explicit visual-feature fusion scheme to capture the pairwise interactions between the feature dimensions of multiple visual modalities and fuse multimodal visual features of videos in an explicit way.Secondly,a novel attention mechanism called semantic embedding guided attention is proposed,which combines with the traditional temporal attention to form a cooperative attention module responsible for generating a joint attention map.In specific,in SEGA,the semantic word embedding information is leveraged to guide the model to pay more attention to the most correlated visual features at each decoding stage.In this way,the semantic gap between visual and semantic space is alleviated to some extent.Furthermore,based on the above,the author further proposes the second model for video captioning,which is able to extract and leverage audio context.Specifically,with an explicit visual-audio feature fusion module,the method explicitly fuse visual context and audio context to further explore the interactions between visual and audio modalities.To evaluate and validate the proposed models,the author conducted extensive comparative and ablation experiments on two widely used datasets,i.e.MSVD and MSR-VTT.For video captioning,the main contributions of this thesis are summarized as follows:(1)The author proposes an explicit feature fusion(EF)scheme,including explicit visual feature fusion(EVF)and explicit visual-audio feature fusion(EVAF),to model the pairwise interactions between different types of features and fuse them into a single feature vector explicitly.(2)The author proposes a novel attention mechanism called semantic embedding guided attention(SEGA),which computes attention weights by exploiting semantic word embedding information.Cooperating with temporal attention,it generates a more meaningful joint attention map.(3)Extensive experiments on MSVD and MSR-VTT datasets are conducted.The results demonstrate that our approach achieves state-of-the-art results.Meanwhile,extensive ablation studies illustrate the effectiveness of the proposed mechanisms,i.e.EF and SEGA. |