| With the continuous development of short video platforms,people’s demand for video processing is also increasing.Video caption is an effective method for understanding video content.It can automatically generate language description on the basis of fully understanding video content,and realize intelligent analysis of video content.Video caption method can be applied in video retrieval,assistance for visually impaired persons,video surveillance and other fields.Since the video caption task involves information from both visual and text modalities,in order to bridge the semantic gap between visual and text modalities and improve the model’s ability to describe video content,mainstream video caption methods are based on encoding and decoding framework.However,the description sentence output by the existing video caption model still has the problem of inaccurate and indetailed description.In this paper,a Visual Semantic Enhanced Encoder is constructed,including a visual-semantic embedding module and a multimodal feature fusion module.The video semantic embedding module is used to mine the semantic information inside the static features and temporal features of the video,and the multi-modal feature fusion module is used to capture the high-level interactive relationship between the two features.The two modules complement each other and help the encoder generate more powerful feature representations.Since vision and text belong to two different modal information and are in different feature spaces,it is very challenging to align these two features in the same space,and then convert visual information into text information.Therefore,this paper designs a Visual Guided Decoder for explicitly aligning these two features.The vision-guided decoder consists of two parts:a visual decision module and a dependency controller.Firstly,the Dependency Controller module is used to introduce the visual information most relevant to the predicted word,and then the contribution degree of the visual information and the text information is adaptively controlled in the process of word generation by relying on the controller,so as to prevent the decoder from relying too much on the superficial correlation between words properties(i.e.,language priors)to generate language descriptions that are unrelated to video content.Finally,this paper applies the algorithm of this paper to conduct experiments on the two public data sets MSVD and MSR-VTT datasets of the video caption.Compared with mainstream methods,the description sentences generated by the algorithm in this paper are more fluent and accurate,and can better reflect the content of the video. |