Monocular depth prediction,which aims to recover the per-pixel depth information from a 2D image,is a fundamental yet critical problem in Computer Vision and Pattern Recognition.It is widely used in many applications,including autonomous driving,robotics,2D-to-3D conversion,mobile entertainment,etc.However,the image taken by an ordinary camera is a 2D projection of the real world,thus missing the depth information of the scene.Although depth sensors can capture depth data,their application is limited due to defects such as limited depth perception distance,the low resolution of the captured depth,and sparseness.Therefore,predicting depth information from monocular images has drawn more and more attention from academia and industry.Because one of the main techniques of monocular depth estimation is realized through supervised learning,the depth data collected by the depth sensor has a large number of outliers and insufficient sample diversity,which seriously affects the accuracy and generalization of the model.To address the above issues,this dissertation makes the following contributions:A robust regression model for monocular depth prediction is proposed to handle a large number of outliers in the depth data.Specifically,by filtering outliers at each iteration and calculating only the losses of valid pixels,the adverse effects of a large number of outliers on model training can be eliminated.Meanwhile,an encoding-decoding network based on multi-scale feature fusion is designed to output structured depth prediction.In order to solve the issue that pixel-wise metrics cannot accurately measure the accuracy of depth structure,this dissertation introduces a structure-aware objective metric.The experimental results on noise simulation data and real data verify the effectiveness of the proposed method.Because of the insufficient diversity of depth data and the high cost of obtaining dense depth labels,the dissertation proposes a method to generate dense depth labels automatically.To this end,a method combining optical flow estimation,semantic segmentation,and some post-processing techniques is proposed to generate dense relative depth maps from web stereo images.Further,this dissertation presents a new dataset consisting of a variety of RGB images and dense relative depth maps.Since the scale and shift factors of this data are unknown,this dissertation proposes a pair-wise ranking loss for model supervision.The experimental results show that the proposed method,which has the ability of depth perception in unconstrained scenes,outperforms other state-of-the-art methods.Existing state-of-the-art methods are prone to predict globally inconsistent depth maps with blurry depth boundaries and missing depth structures.To tackle the above issues,this dissertation proposes a structure-guided ranking loss.In particular,we guide the sampling to better characterize the structure of important regions based on low-level edge maps and high-level object instance masks.The experimental results show that the pair-wise ranking loss,combined with our structure-guided sampling strategies,can significantly improve the quality of depth prediction.To further improve the accuracy and generalization of the model,this dissertation optimizes the proposed data generation method.Specifically,accurate sky segmentation masks are extracted to obtain precise regions with disparity zero,and a strategy of left-right consistency checking is introduced to provide confidence maps of the generated depth.Then,a dense depth dataset with higher quality and richer diversity is constructed.In experiments,the cross-dataset evaluations on six benchmark datasets show that the proposed method achieves superior quantitative and qualitative results.Finally,this dissertation applies the techniques of monocular depth prediction to shallow depth-of-field rendering.To this end,a shallow depth-of-field rendering system is designed based on salient object detection and depth prediction.More specifically,shallow depth-of-field rendering can be divided into two steps: focus determination and depth-offield rendering.To find the focus automatically,a salient object detection method based on the label-guided contrastive loss is proposed.Besides,a physically motivated method termed scatter-to-gather is designed to keep the refocused plane clear and enable smooth transitions around depth discontinuities.The experimental results on synthetic data and real data verify the effectiveness of the proposed method. |