| With the explosion of massive video data and the development of the artificial intelligence technology,video captioning has become a hot research topic.It has various application prospects in life.Such as sports video commentary,description of e-commerce products,video title generation and so on.The sequence learning based “Encoder-Decoder” structure combined with attention mechanism and attribute information is widely used in video captioning domain.However,these approaches bring about the following two problems.At first,the expression of video sequence modeling is insufficient.Secondly,they ignore the alignment between vision and language.Considering about the above two problems,the research contents proposed in this paper include the following two parts:(1)A video description method based on multi-feature fusion and feature reconstruction is proposed.In this section,the temporal modeling of the video is carried out by means of the fusion of spatial features and motion features which can generate discriminative visual expression.In addition,this paper uses feature reconstruction to optimize the learning ability of the decoder which can acquire more mapping from visual to text language.Therefore,the semantic connection between vision and language is strengthened.(2)A video description method based on multi-modal feature representation and semantic guidance is proposed.In this section,this proposes the fusion pf audio,visual features and other multi-modal information to obtain the content in the video to further enrich the expression ability of features.In addition,this paper designs a semantic information encoding module which can generate the interaction between different visual entities in the video.Finally,a multi-modal attention mechanism is constructed to guide the decoder to select different features or semantic information at different moments to strengthen the correlation between vision and language.Sufficient experimental results on two large-scale datasets,including MSVD and MSR-VTT,demonstrate the proposed two approaches can enhance the performance of video captioning.And they can generate high quality language sentences. |