| We address the audio-visual content-based event localization,which aims to automatically predict the starting and ending time of events that occurred in video and audio and recognize the categories of the events.This task enables providing technical supports for a wide range of application scenarios such as intelligent security,medical aided diagnosis,and short video entertainment.However,due to the complex content,strong redundancy,and the difficulty in understanding in audio and video,this task has the following challenges: 1)The temporal boundary of the event is difficult to accurately localize since the durations of various events vary extremely;2)Accurate event recognition is difficult to achieve because of changeable spatiotemporal content and strong interference from backgrounds.Most of existing methods focus on how to capture unimodal temporal dependencies and to perform effective audio-visual fusion for better content understanding,ignoring the negative performance impacts brought by the above challenges.To address the above challenges,we seek to study this task in two aspects,namely improving localization accuracy and boosting event recognition,shown as follows.To improve localization accuracy,we propose a temporal boundary-aware algorithm.We recalibrate temporal features and produce multi-scaled temporal feature sequences combined with cross-modal relationships to provide the model with a more comprehensive temporal context.Our method outperforms the state-of-the-arts on AVVP dataset(e.g.,55.4% vs.54%).To boost event recognition,we propose a cross-modal relation-aware algorithm.Inspired by humans’ understanding mechanism for audio-visual content,we explore audio-guided visual attention and vision-guided audio attention mechanisms to alleviate the interference brought by visual backgrounds and filter irrelevant sounds.Moreover,we exploit intra-and inter-modal relationships for cross-modal information complementation,which facilitates the understanding of audio-visual contents and boost event recognition.Experiments demonstrate that our method significantly outperforms the state-of-the-arts on AVE dataset(e.g.,73.6% vs.70.2% in the weakly supervised setting),showing the effectiveness and superiority. |