| Human attention mechanism is not only affected by visual stimuli but also by audio signals.Existing video saliency detection algorithms only use visual signals as input information,while rarely considering audio signals that are essentially important on saliency with semantically rich auditory information.Therefore,it is of great significance to make full use of semantic information in audio signals to assist saliency detection based on visual information.When different audio classification networks are trained on various datasets,generally they have different focus on the types semantic information,which leads to different types of audio features.That is different data sources generally produce different audio features.Therefore,to ensure the integrity of the output features of audio networks,different datasets should be selected to train audio networks to improve the generalization ability of the network.For the dual-current network structure used in audiovisual saliency detection methods,the sound and visual signals mutually influence each other,which has positive and negative effects on beneficial and useless signals,respectively.In the detection of audiovisual saliency,video signals play a more important role than the audio signals.When the audio and visual signals are inconsistent,the former has a negative impact on the latter in the dual-stream network,which weakens the visual features of the object and leads to the inaccuracy of saliency prediction.So it helps solve the above problems to keep visual information and perform feature enhancement after audio-visual fusion.Finally,it is also crucial to locate and fully interact audio and video information.Traditional fusion methods ignore the importance and fail to achieve effective fusion of the feature attributes.Therefore,attention mechanism should be adopted for feature fusion to achieve effective information interaction.In view of the above problems,this thesis proposes two solutions:1.A multi-stream audio-visual saliency detection algorithm based on co-attention is proposed.Traditional audio-visual saliency detection algorithms use an audio stream and video as input,however,for the same data,different audio features are often generated by the audio networks pre-trained with different datasets.Therefore,when the audio network is pre-trained on different datasets,it generates various types of semantic information by recognizing audio signals.Based on the audiovisual model,an additional audio network is integrated which is trained on large datasets and capable of recognizing more audio signals.Firstly,in this thesis,we choose audio networks that are trained on different datasets,which ensures the integrity and accuracy of audio information extraction.In addition,tradition ways of audio and visual information fusion have negative impact on information extraction and the learning of common features,which leads to degraded feature fusion.To address this issue,this method adopts the co-attention mechanism for audio and visual information fusion and learns the correlation between the two types of information to ensure their consistency.2.A multi-stream audio-visual saliency detection algorithm based on visual information compensation is proposed.When the audio and visual signals are inconsistent,irrelevant audio information weakens visual information and influences the effects of visual information in traditional muti-stream audiovisual saliency methods.Firstly,in this thesis,we choose a video encoding branch that preserves the whole object appearance and motion information in video signals to compensate the weakening of audio to video and to enhance the salience of video features.Secondly,fusion strategies determine the effect of information compensation.This method uses the feature fusion strategy to integrate video encoding features with audiovisual saliency features,which enforces the expression of visual information and achieves the compensation of visual information.Theoretical analysis and experimental results show that the proposed methods outperform the baseline methods on audiovisual datasets and obtain promising performance in saliency detection.Both the additional audio branch and the video encoding network play important roles in improving the final performance of saliency detection.The fusion strategy renders various types of features to effectively interact with each other,which enhances the importance of useful features.This strategy avoids feature information loss and feature weakening that is caused by the no synchronization issue between audio and video. |