| Salient Object Detection aims to simulate the human visual system to detect and segment the most attractive targets and image regions.In many computer vision tasks,salient objects play an important role as a preprocessing step.Since depth maps contain more spatial structure which is not available in RGB images,recently with depth sensors developed and depth maps more easily obtained,more and more researchers in relevant fields are interested in SOD based on RGB-D.So far many remarkable research results have been achieved,while there still exist many issues to urgently be addressed.In this paper,two RGB-D salient object detection models are proposed to address the existing problems.(1)Most existing RGB-D salient object detection models use depth maps and RGB maps to complement each other for salient object detection.However,there are numerous low-quality depth maps in RGB-D salient object detection datasets,which affect the final saliency prediction negatively.In order to deal with this problem,a novel three-stream complementary network is proposed.Firstly,high-quality depth maps are chosen as training targets of RGB stream,and the corresponding RGB maps are used as inputs for training.In this way,the trained RGB stream can generate an estimated depth map for each RGB map.Secondly,a feature complementation fusion module is constructed to merge original depth stream features,estimated depth stream features and RGB stream features.Finally,a top-down decoder with large-scale receptive field is designed to decode salient features from different stages and predict saliency maps.Experiments on 4 benchmark datasets comparing with 7 state-of-the-art models demonstrate that our model can achieve obvious improvement.(2)Most existing models of RGB-D salient object detection utilize heavy backbones like VGGs and Res Nets which cause large model size and high computational costs.In order to address this problem,a lightweight two-stage decoder network is proposed.Firstly,Mobile NetV2 and a customized backbone are used in the network to extract the features of RGB images and depth maps respectively.In order to mine and combine cross-modality information,cross reference module is used to fuse complementary information from different modalities.Subsequently,a feature enhancement module is designed to enhance the clues of the fused features which has four parallel convolutions with different expansion rates.Finally,a two-stage decoder is used to predict the saliency maps,which processes high-level features and low-level features separately and then merges them.Experiments on 5 benchmark datasets comparing with 10 state-of-the-art models demonstrate that our model can achieve obvious improvement with smallest model size. |