Computer vision and natural language processing are hot research areas in the field of artificial intelligence.Their goal is to let computers simulate human visual and auditory systems.Saliency detection is an important research content of computer vision,especially that video saliency detection has received more attention.Video saliency detection aims at automatically identifying the most attractive object or area in a scene by simulating human vision system,which can help people acquire important information from massive data and allocate limited computing resources to more important information.However,we humans perceive both sound and visual modes at the same time.So the saliency detection of the combination of audio and visual is more consistent with the real scene.Due to the stark differences between audio and video information in the ways of processing as well as the extracted features,detection of audio-visual saliency is extremely challenging.Most of the existing video saliency detection methods are focusing on the visual mode,where most of them use the Long Short-Term Memory or optical flow method to mine the features between frames.However,they often omit the audio mode.At present,the saliency detection algorithms of audio and video mainly use a dual-stream structure to extract the features of audio and video information,where their simple fusion is used as the final saliency prediction.However,the audio information and visual information may be irrelevant in some scenarios.In this case,the direct fusion of such information renders the audio information to have a negative impact on saliency detection.To solve the above problems,the following solutions are proposed:1.In this thesis,we propose a saliency prediction network based on audio and visual combination,which consists of four parts: spatio-temporal visual feature extraction(V branch),audio feature extraction(A branch),audio and visual fusion mechanism(A+V fusion)and saliency feature calculation.In our work,the influence of audio information on visual saliency detection is fully considered,and a more efficient multi-mode fusion algorithm is proposed to improve the accuracy of saliency detection.2.In this thesis,we design an audio-visual consistency evaluation network,namely(AVCE-Net).In order to solve the negative impact caused by ambient sound or background sound in the video,we propose to add the AVCE-Net to the audio-visual saliency detection model to promote audio class-sensitive video saliency detection.Firstly,AVCE-Net makes a consistency binary judgment on the extracted audio and video features.When they are consistent,the audio-visual fusion features are output as the final prediction map;otherwise,the visual dominant features are output as the final result.At the same time,in order to train AVCE-Net,we annotate the six public audio-visual datasets.In order to verify the reliability of the proposed algorithm,experiments were conducted on six publicly available datasets.Extensive experimental results show that,with the consideration of audio information,the detection has improved robustness,detection accuracy,and consistency with the real scene. |