In recent years,with the rapid development of the video industry and the explosive growth of multimedia data on the Internet,video media has gradually become one of the main media for users’ daily consumption,so the accurate retrieval of video content has become an important technical demand.The task of cross-modal video retrieval using textual descriptions has also received great attention for its own research value.At the same time,video language pre-training can help the model learn the common knowledge of video and text in large-scale data,which is conducive to improving the performance of cross-modal retrieval.However,there are two problems in the existing work.The first problem is that existing end-to-end video language pre-training methods usually lack of mining multi-level information in videos,such as local semantic information of video frames and global video information.The second problem is that most of the existing cross-modal retrieval models use image-text pre-trained models and then fine-tune downstream.However,the image-text pre-training model only contains image information,but lacks the exploration of temporal information and semantic information between video frames.Based on the above analysis,this paper proposes an end-to-end multi-modal pretrained model HMMC.To solve the first problem,this paper proposes a multi-level matching mechanism to fully mine the multi-level information in the video,and make up for the problem that the image-text pre-training model is used to extract video frame features in end-to-end training,but the insufficient mining of video multi-level information.The key idea is to explore the hierarchical information in the video through multi-level matching between the video and the text.This design is motivated by the observation that if a video is semantically matched to text(which can be a title,tag,or captions),frames in that video often have a semantic connection to the text and show higher similarity than frames in other videos.Multi-level matching is mainly implemented by two proxy tasks: video text matching and frame text matching.Exploring the multi-level semantic connections between video and text can enhance the video language understanding ability of the model.To address the second problem,we propose a frame adjacent matching pre-training task to fully mine the semantic information between video frames for self-supervised pre-training.The temporal Transformer is used and the position encoding of the input video frames is added to perceive the sequence of the input video frames,so as to fully mine the temporal information between video frames.In addition,a momentum comparison framework is introduced into HMMC to form a multimodal momentum comparison framework,so that HMMC can utilize more negative samples in end-to-end training.We also collect a large-scale Chinese video language dataset(over 763 k videos)called CHVTT.Each video has a title and several individual labels.To the best of our knowledge,this paper is the first work to leverage video captions and tags to benefit video language pre-training.Experimental results on two major text-video retrieval benchmark datasets,MSR-VTT and VATEX,both in Chinese and English,demonstrate the advantages of the proposed method. |