| In the real world,there are melodious piano sounds,refreshing flower fragrances,and beautiful scenery.People can receive and analyze rich multi-modal information flow through the developed sensory system to form decisions to guide production and life activities.In the past ten years,with the continuous advancement of science and technology,intelligent machines have entered thousands of households.How to improve the understanding ability of machine and the friendliness of human-computer interaction through multi-modal information fusion has also become a research hotspot in the field of artificial intelligence.Among them,audiovisual event recognition technology and sound source localization technology base on audiovisual fusion have broad application space in the fields of human-computer interaction,intelligent monitoring,and video detection,which can effectively improve the information processing system of the machine,give the machine intelligence so that it can better serve human,promote the construction of intelligent society in the future.In the field of deep learning,many audio-visual recognition algorithms and sound source localization algorithms based on the dual-modal fusion of auditory and visual information are emerging.However,how to extract and fuse sound features and image features more effectively,as well as how to decouple sound source information to more accurately locate the specific sound source location in the image,is still a challenge in the field of audio-visual event recognition and sound source localization.Given the above research difficulties,this paper conducts extensive research on audio-visual event recognition and sound source localization algorithm,and proposes an audio-visual event recognition model and sound source localization model integrating audio and video dual modes under the framework based on deep learning.The main work and innovation of this paper are as follows:1.For audio-visual event recognition,an audio-visual event recognition model based on spatial channel feature fusion is proposed.The model fully considers the different mechanisms of spatial feature extraction and channel feature extraction of information by convolutional neural network.A spatial channel feature fusion module based on CNN is used to extract and fuse sound information and image information.Meanwhile,the interference of irrelevant information is effectively reduced through the attention diagram contained in this module.In the secondary fusion stage of audio-visual features,after Bi-LSTM processing of audio-visual fusion features and sound features,the complexity of the model is effectively reduced through the feature-level fusion method of direct splicing.In the stage of recognition and classification,the event-background recognition is carried out first.Then the audio-visual event category is judged,finally the accuracy of the audio-visual event recognition in the video is effectively improved.2.For sound source localization,a multi-label contrastive learning model based on image segmentation is proposed.This model introduces the method of image instance segmentation and image classification to the task of sound source localization.Firstly,based on the common points of image instance segmentation and sound source localization,the image instance segmentation network SOLO is introduced to extract and segment image features,and then the instance masks belonging to the same category are merged.Secondly,inspired by the class activation mapping heat map in the image classification task,a multi-label classification method is proposed to decouple the possible sound sources in the image and the contradiction of too small granularity of recognition and prediction caused by a large number of image categories is alleviated by establishing the category mapping relationship table.Finally,the response graph of the sound feature and image instance segmentation object and the class activation map is weighted to obtain the final sound source location.3.As for the proposed audio-visual event recognition model based on spatial channel feature fusion and sound source localization model with multi-label contrastive learning using image segmentation,this paper conducted an experimental evaluation on public data sets AVE and SoundNet-Flickr,respectively.The quantitative index and qualitative heat map of the experimental results show that the algorithm proposed in this paper has certain advantages compared with the current common methods.At the same time,multiple ablation experiments are designed to verify the effectiveness of the proposed spatial channel feature fusion module and the rationality of the sound source localization model design. |