| Video is an important information carrier for people to record and reflect real life in the information society,containing rich semantic information.Using multimedia technology to localize specific semantic content from a video containing multiple scenes and activities can facilitate the use of video content and improve the human-computer interaction process.To localize the target segment containing a specific activity from a video,traditional video action localization methods use a set of pre-defined actions to localize a specific action instance,however,these methods can not recognize action instances outside the pre-defined set.In contrast,text and audio can flexibly represent a variety of activities and are not constrained by the activity categories.Using text or audio as a cross-modal query for video segment localization is more practical and has a wide range of potential applications.Text can flexibly describe complicated activities and has rich semantic information,and is suitable for localizing activity segments composed of various characters and actions in a video.Audio can carry natural sound information of activities,which can directly reflect the content of different activities,and is suitable for localizing the segments with sound characteristics in a video.Both text and audio have their own domain of application and are irreplaceable from each other.To localize the target segment in a video accurately based on a text or audio query,the algorithm needs to be able to process the information of the query,as well as understand the video content.After extracting the feature representations of the query and video,the algorithm also needs to perform elaborate cross-modal interaction and inference to align cross-modal semantics.Therefore,effective feature representation and cross-modal interaction are the two keys to cross-modal video segment localization.With the above research background,this paper conducts an in-depth research on cross-modal video segment localization based on text or audio queries.The main research contents and innovations of this paper are as follows:(1)Propose a dual path interaction method for video segment localization with text query.To address the problem of feature representation lacking segment discrimination,this paper proposes a dual path interaction method that uses two paths to encode the frame-level representation for boundary discrimination and the proposal-level representation for semantic alignment,respectively.Furthermore,this method takes the text query as a semantic condition to guide the interaction of different representations of the two paths multiple times,to enhance the semantic consistency and improve the feature representation.The experimental results on three public datasets,i.e.,TACoS,Charades-STA,and ActivityNet-Captions,show that this method can significantly improve the performance of video segment localization.(2)Propose a structured multi-level interaction method for video segment localization with text query.To address the problem of lacking fine-grained relationships during cross-modal interaction,this paper disentangles the segment into a content part and a boundary part according to the inherent structure of the segment and proposes a structured multi-level interaction method.This method interacts the whole segment with the whole query;and interacts the content part and the boundary part of the segment with different semantic parts of the text description,as in a coarse-to-fine manner,to enhance the semantic alignment of the two modalities.The experimental results on the three above-mentioned public datasets show that this method significantly outperforms previous methods in video segment localization.(3)Propose a multi-level context-aware method for video segment localization with audio query.To address the problem of feature representation lacking context understanding,this paper proposes a multi-level context-aware method that encodes local and global context information for different features.In particular,the local context from adjacent parts provides local detail information,while the global context from the whole audio or video provides global semantic information.In addition,taking event segments and segment boundaries as event internal and boundary context information can enhance the consistency between different moments within the event,and improve the representation of features.Experiments on the AVE dataset show that this method significantly outperforms existing methods in video segment localization.(4)Propose a semantic and relation modulation method for video segment localization with audio query.To address the problem of lacking correlation utilization during cross-modal interaction,this paper proposes a semantic and relation modulation method.This method utilizes intra-modal and cross-modal semantic information to modulate audio and video features.In addition,the method utilizes the relation information between different cross-modal moment combinations and the relation information between different candidate segments to modulate cross-modal fusion features and enhance cross-modal semantic alignment.Experiments on the AVE dataset show that this method outperforms the state-of-the-art methods by clear margins in video segment localization. |