Considering computer vision and speech recognition as the carrier,audio-visual cross-modal learning utilizes the vision and sound to jointly learn information between modalities and enhance complementarity.Audio-visual cross-modal learning has been widely used in cross-modal retrieval,video content review,intelligent surveillance,and humancomputer interaction,which have been received extensive attention from many researchers.Audio-visual event localization and recognition requires spatio-temporal localization and recognition of events that contain both visual and audio information in the video.However,due to the serious structural heterogeneity and semantic gap between the visual and audio data,it is difficult to establish semantic audio-visual associations,resulting in a significant decline in event localization and recognition performance.Based on the investigation of existing methods,this dissertation summarizes three key problems:feature matching for spatial localization,temporal modeling for temporal localization,and feature fusion for event recognition.This dissertation mainly focuses on the fusion of similarity measurement functions,selection of semantically consistent video clips,and multi-layer interaction of multi-modal features in cross-modal learning to improve the performance of audio-visual localization and recognition tasks.The following innovative researches are mainly carried out:(1)The single similarity calculation method in audio-visual crossmodal learning matches audio and visual features pixel by pixel,which is prone to produce inaccurate spatial distribution of probability and reduce the spatial localization accuracy of audio-visual events.Aiming at the similarity ambiguity problem in feature matching,this dissertation proposes a distance fusion-based sound source localization networks to achieve spatial matching of audio-visual features under a dual-stream framework.First,a convolutional recurrent neural network is used to encode the temporal information of the sound.A grouped global average pooling operation is proposed to compress the spectrogram features into fixed time-step sound vectors for sound input of any duration in the model reasoning stage.Then,a cross-modal channel attention mechanism is proposed to map sound features into visual features and enhance the connection between channel information of audio-visual modality.Finally,a distance fusion module is proposed to fully describe the similarity of audio-visual features,effectively improving the accuracy of feature matching.Performance testing is implemented on the widely used SoundNet-Flickr and FAIR-Play datasets.The consensus intersection over union and the area under the curve achieve 0.852 and 0.623,respectively.Compared with the highest scores in recent years,the correlation coefficient and similarity metric improve by 0.3%and 2.5%,respectively.This demonstrates the effectiveness of the distance fusion strategy.(2)The temporal boundary of audio-visual event is unknown in the video.Directly performing temporal modeling on all video clips will introduce background noise and irrelevant visual target,restricting the accuracy of temporal localization.Aiming at the problem of audio-visual inconsistency in temporal modeling,this dissertation proposes a consistent segment selection network to filter background noise and irrelevant visual objects in videos.First,a bidirectionally guided co-attention module learns latent audio-visual associations to simultaneously focus on sound-related visual regions and event-related sound parts.Then,for learning the global semantics of events,a context-aware similarity measurement module is proposed to select audio-visual segment pairs with high correlation scores.Finally,an audio-visual contrastive loss is proposed to make audio-visual features have similar semantic representations.Performance testing is implemented on the benchmark AVE dataset.The temporal localization accuracies in supervised and weakly-supervised settings achieve 80.5%and 76.8%,respectively.Compared with the highest scores in recent years,the cross-modal retrieval accuracies of audio to video and video to audio improve by 2.7%and 3.5%,respectively.This demonstrates that the proposed method can select the semantically consistent video segments.(3)The structural heterogeneity of audio-visual modality data leads to inconsistent data distribution and representation.It is difficult to obtain event semantics by effectively integrating audio-visual features,which limits the improvement of audio-visual event recognition accuracy.Aiming at the problem of structural heterogeneity in multi-modal fusion,this dissertation proposes a local-to-global multi-modal interaction network to perform multi-layer fusion of RGB,optical flow and sound features.For the local multi-modal interaction strategy,an inter-modal channel recalibration module is proposed to alleviate the heterogeneity between modalities.An RGB modality aggregation module is proposed to make the model focus on the visual area where the event occurs,which can improve the discrimination of RGB features.For the global multi-modal interaction strategy,unitary encoder,parallel encoder and triplet encoder based on weighted transformer structure are proposed to balance the recognition accuracy and efficiency.Performance testing is implemented on the UCF101 subset,Kinetics-Sounds and EPIC-Kitchens-55 datasets with videos containing audio tracks.Parallel encoder adapts to the static scenario with low volume.Triplet encoder adapts to the dynamic scenario with high volume.The triplet encoder model achieves the highest recognition accuracies of 96.05%,88.37%and 25.8%,respectively.Compared with the highest scores in recent years,the computational complexity reduces by 77.7%,28.3%and 83.4%,respectively.This proves the complementarity of the sound modality to the visual event recognition and the significant recognition accuracy improvement of the sound modality to similar target appearances.To sum up,the audio-visual event spatial localization,temporal localization and recognition models proposed in this dissertation explore the cross-modal learning of audio-visual modalities,which improve the existing methods from three aspects:feature matching,temporal modeling,and feature fusion.A large number of experiments are conducted to verify the feasibility. |