| The salient object detection is used to locate the most attractive object in the scene.With the rapid development of deep learning,the performance of salient object detection has been greatly improved and has been widely used in various fields.Multi-modal salient object detection enables the model to be enhanced in multiple features by learning different forms of images.RGB-T salient object detection uses RGB image and thermal image as input,uses thermal information to supplement RGB image,and completes salient object detection with multi-modal features.Because thermal images are insensitive to light conditions,they play an effective role in a variety of challenging scenes.However,there is still room for improvement in the dataset scale and network structure design of the current RGB-T saliency detection model.In this thesis,a RGB-T dataset is built,the network structure of dual-branch and single-branch is studied respectively.The main work is as follows:(1)Compared with other computer vision fields,the dataset for RGBT salient object detection is smaller,and the diversity and professionalism of the data scene are insufficient.To solve this problem,this thesis builds a dataset VT723 which is suitable for RGB-T salient object detection.The dataset contains 723 sets of corresponding RGB images,thermal images and saliency labels,presenting the scene of vehicle driving.V723 provides more images for training and evaluation,expands the application of salient object detection.(2)In the current dual-branch network structure,only a unique label is usually used to supervise the two branches,which makes it difficult to mine the differences between different modes.To solve this problem,this thesis designs an RGB-T salient object detection model based on the mirror complementary network.The saliency labels are divided into skeleton labels and contour labels according to the Euclidean distance,which are used to supervise RGB images and thermal images in the two branches respectively,giving full play to the unique advantages of different modal.At the same time,a dual-branch complementary module is designed to extract the attention of the two modal information and add each other in different branch to achieve low-level feature fusion.In addition,a serial multiscale dilated convolution module is designed to expand the receptive field and achieve high-level feature fusion.Experiments on public benchmark datasets and VT723 show that the proposed method outperforms state-of-the-art approaches on different evaluation metrics.(3)In the current single-branch network structure,the multi-modal images are usually simply concatenated,and the correlation information between modes is not extracted in the large-scale convolution operation.To solve this problem,this thesis designs an RGB-T salient object detection model based on multi-dimensional spatial information.This model introduces spatial dimension features into the single-branch network,arranges the two modal images in the spatial-dimension,inputs them into the 3D convolution network for feature extraction,and fully learns the correlation information between modals.At the same time,a level-by-level collaborative weighting module is designed,which weights the location information and channel information of different scale features layer by layer,and focuses on important features in 3D convolutions.Finally,the mixed loss function is designed to provide more comprehensive loss calculation and make the model converge efficiently and accurately.Experiments on public benchmark datasets and VT723 show that the proposed method outperforms most approaches on different evaluation metrics. |