| In the daily life surrounded by multimedia information,how to find the desired knowledge from the massive multimedia data has gradually become difficult.Therefore,the development of multimodal retrieval has also been born,which searches items in massive databases through the query text provided by the user.Meanwhile,some more fine-grained requirements have also been proposed.For example,the cross-modal video moment retrieval task mainly retrieves video clips in a untrimmed video,which can be used in application scenarios such as movie highlight and character/background search.Currently,solving this task can be divided into two branch methods: ranking-based and localization-based methods.However,the ranking-based method is limited by the pre-segmented video moment candidates,while the localization-based method has unstable convergence.Therefore,how to improve the video content understanding ability and optimize the model execution efficiency is the key to this task.Benefiting from the rapid development of existing cuttingedge technologies in deep learning,such as reinforcement learning,generative adversarial learning,and graph representation learning,the multimodal content understanding has been greatly improved,especially the understanding of language and vision has become very rich and effective.Therefore,this paper will start from the existing shortcomings of the crossmodal video moment retrieval task,and try to optimize it in multiple aspects by combining various cutting-edge technologies.Specifically,the main research topics of this paper are as follows:(1)Aiming at the lack of spatial awareness of localization-based methods,a video moment retrieval model that fuses spatiotemporal information is proposed.Specifically,it uses a temporal reinforcement learning to locate temporal boundaries and a spatial reinforcement learning to track local scenes.In this way,not only rich spatial information is supplemented but also semantically irrelevant redundant background information can be removed.(2)Aiming at the instability and low efficiency of the localization-based and rankingbased methods,a video moment retrieval model based on the fusion of ranking and localization is proposed.Specifically,it takes the localization-based method as the generator to generate a reasonable number of video clips,and the ranking-based method as the discriminator to feed back a flexible reward score.Joint training of the two modules under the adversarial learning paradigm can enhance multimodal content understanding and avoid some instability and inefficiency defects.(3)Aiming at the lack of ability to distinguish similar clips of ranking-based method,a video moment retrieval model based on object interaction is proposed.Specifically,it captures the interaction among objects in video and text by building a multimodal relational graph.In this way,not only irrelevant clips can be screened in advance,but also multimodal information can be fine-grained and fused to enhance the performance.Finally,extensive comparisons and analyses on two public datasets,Charades-STA and TACo S,show that the above three solutions enable the model to learn more appropriate multimodal information.Further,various evaluation metrics can be improved in a more efficient and stable manner,so that the quality of the video clips returned to the user is more accurate. |