| Multi-modal Salient Object Detection commonly uses visible light images(RGB)as the main modality,and depth images(D)or thermal infrared images(T)as auxiliary modalities to jointly detect salient objects.Since T images are weakly affected by illumination conditions,they can provide necessary complementary information for RGB images,so predicting salient objects from dual-modal images is of great significance for applications such as autonomous driving and intelligent monitoring.In this dissertation,RGB and T images are taken as the research objects,aiming at the problem of lack of detection caused by the weak representation ability of existing methods,the research on dual-modal feature representation is carried out,and feature enhancement is applied to solve this problem.The main research contents are as follows:(1)To solve the problem of incomplete object boundary contour detected by the model,this dissertation proposes a detection method based on multi-level feature fusion and attention enhancement.By fusing dual-modal multi-level encoded features layer by layer from low-level to high-level,this method generates global context features with both lowlevel spatial detail information and high-level semantic information,and uses the original T images to enhance the encoded RGB images features to attenuate the background noise interference,and explicitly construct the shared features of the two modalities to refine the boundary contour of the object.The comparison experiments,ablation experiments and visualization results prove that the method can suppress the interference of background noise,and clearly highlight the object,making the boundary outline of the object completer.(2)To solve the problem of inaccurate location of salient objects by the model,this dissertation proposes a detection method based on global features enhancement.This method enhances the global features by mining the cross-scale information of the dual-modal fusion features,so that it can provide more accurate semantic and location information,and continuously updates the global feature by continuously mining and fusing the multi-scale features of the dual-modal during decoding,so that the model can locate and detect objects more accurately.The comparison experiments,ablation experiments and visualization results prove that the method can accurately locate the salient objects even in complex scenes,and improve the confidence of the prediction results.(3)To solve the problem of internal incompleteness of the object detected by the model,this dissertation proposes a detection method based on dual-stage dual-modal feature enhancement.This method uses the self-attention structure in the encoding stage,extracts the dual-modal multi-level encoded features containing more effective information rely on its advantages in modeling long-range image patch correlations,and uses the dual-modal fusion structure in the decoding stage to make the dual-modal features complement each other.The comparison experiments,ablation experiments and visualization results prove that the method can give full play to the complementary advantages of the dual-modal,and can accurately and completely detect salient objects even when the internal features of one modality salient object are missing or invalid.In conclusion,this dissertation effectively improves the problem of missing salient objects detection from simple scenes to complex scenes by enhancing various types of features,fully exploring and utilizing the correlation and complementarity of the dual-modal.The experimental results on VT821,VT1000 and VT5000 data sets prove the effectiveness and reliability of the proposed method. |