With the explosive growth in the number of short videos,the problem of how to handle the huge amount of videos effectively has become an urgent one.Video captioning aims to use machines to generate natural language descriptions of the input video and is one of the common means of processing video.Video captioning involves both computer vision and natural language processing,and has important applications in areas such as human-computer interaction,content-based video retrieval,and intelligent driving.Current video captioning methods suffer from poor correlation between visual and text features,inadequate fusion of multimodal features,and over-reliance on semantic features.This paper presents an in-depth study of the above issues,with the following main work and innovations:(1)To address the problem of weak correlation between visual features and text,this paper proposes a video captioning method based on Multilinear Spatio-Temporal Correlation and Feature Reconstruction(MSTC-FR),which designs a multilinear spatio-temporal correlation module and a feature reconstruction module to improve the correlation between visual and text.The multilinear spatio-temporal correlation module first generates a joint bilinear representation of the input features and query values using a bilinear pool,then calculates the attention distribution in the temporal and spatial dimensions respectively,and finally uses a hierarchical fusion mechanism to dynamically fuse the extracted visual and non-visual features to filter out the most relevant features to the text.The feature reconstruction module uses the linguistic features generated by the decoder to reconstruct the video features,reducing the semantic deviation of the reconstructed features from the visual features extracted by the encoder and further optimising the model performance.(2)To address the problem of inadequate multimodal feature fusion,this paper designs the Multimodal Feature Fusion Network(MFF-Net)to achieve effective fusion of video appearance features,motion features and object features.Multimodal feature fusion is guided by appearance features that contain more video content.First,different modal features are mapped to a common space,then a correlation matrix is constructed to filter content-related information in motion and object features,and finally the filtered information is fused with appearance features to generate multimodal fusion features.The network ensures that the generated features contain the most relevant content information within the video by filtering invalid information from motion features and object features with a relevance matrix.(3)To address the problem of over-reliance on semantic features,this paper proposes the Temporal Semantic Aggregation Network(TSA-Net)to alleviate the reliance on semantic information.Temporal semantic aggregation contains temporal aggregation and semantic aggregation,with temporal aggregation aggregating the temporal information associated in the temporal dimension of the feature and semantic aggregation aggregating the semantic information associated in the spatial dimension of the feature.The network aggregates both temporal and semantic information in the features,reducing dependencies when using semantic features directly,and aggregating the most relevant features facilitates the decoder to generate more accurate descriptive statements.The method proposed in this paper is experimented on two commonly used datasets MSVD and MSR-VTT in the video captioning task.Experimental results show the superiority of the method in this paper compared to representative methods for the video captioning task. |