Font Size: a A A

Deep Learning-based Depth Estimation Method For Outdoor Scenes

Posted on:2024-08-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y LuFull Text:PDF
GTID:1528307124994539Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Monocular Depth Estimation(MDE)is a fundamental problem in the field of machine vision and 3D perception,which is the process of converting 2D images into 3D scenes,with the aim of perceiving and understanding real 3D scenes.Monocular depth estimation for outdoor scenes are the most extensive applications and have made great progress in research directions such as robot navigation,autonomous driving,virtual reality and augmented reality.Based on the rapid development of deep learning in the field of vision,low-cost vision devices are gradually replacing or aiding traditional ranging devices for scene depth estimation,to a certain extent solving the shortcomings of high cost and complex calibration.The macroscopic world is a four-dimensional manifold consisting of a 3D Euclidean space and a one-dimensional time axis.Machine vision perceives the world mainly through cameras that acquire two-dimensional images or sequences of continuous images in the visible frequency band,but the images only characterize limited information about the 3D world projected on a two-dimensional plane at a given moment,and it is known that the representation of the scene with two-dimensional images lacks temporal calibration and representation of the distance and proximity of scene objects to the camera.Accurate depth estimation of the 3D scene helps to recover accurate 3D structure and clarify object position relationships,which helps the machine to perceive the real world and assists the semantic understanding to make proper decisions.In terms of supervised and self-supervised learning,this paper analyses five key problems of existing depth estimation methods and conducts further research on depth estimation methods:(1)the information transfer problem of network models;(2)the homogenization problem of feature extraction for complex outdoor scenes;(3)the lowreliability geometric constraint problem of inter-frame supervision mechanisms in outdoor scenes;(4)the problem of deterministic motion characteristics of dynamic objects in outdoor scenes;and(5)the problem of self-supervision of intra-frame prior information in specific outdoor scenes.(4)the uncertainty problem of motion characteristics of dynamic objects in outdoor scenes;(5)the problem of self-supervision of intra-frame prior information in specific outdoor scenes.The main works of this paper are shown below.To address the lack of principle and low interpretability of deep learning,and also to address the problems of unreliable geometric constraints and underutilization of supervised information in stereo vision,a theoretical analysis and experimental validation for deep learning-based depth estimation methods is conducted for outdoor scenes.The main contributions include:(1)Generative adversarial depth estimation methods based on convolutional spatial propagation networks is proposed.Despite various efficient network structures in depth estimation algorithms,there exist numerous problems.For the transmission loss and imbalanced data distribution in the network information,the paper proposes a generative adversarial network based on correlation discriminative loss.By constructing a dense connected structure to increase the efficiency of the information transfer path,information loss in the network can be significantly avoided.For unbalanced data distribution,the generative adversarial mechanism establishes an independent depth discriminator to distinguish the correctness of depth maps,which can alleviate the data imbalance issue.For the depth map edge blur problem,the high-order convolutional spatial propagation network is modified to perform self-iterations with the affinity matrix to reconstruct accurate edge gradients.On the KITTI depth dataset,the proposed method reached 0.0720 for Abs Rel,0.3250 for SqRel,2.7020 for RMS and 0.1160 for Log RMS.(2)Robust frequency pyramid network for depth estimation is proposed.In large-scale labelled outdoor depth datasets,depth estimation models still struggle to achieve the similar robustness as biological vision systems.Images in real scenes are subject to massive noise,such as Gaussian noise,fog,motion blur and overexposure.To address the degradation induced by noise,the paper constructs a frequency pyramid network by fusing spectrum domain information from multiple frequency bands.This method is a supervised depth estimation model based on frequency domain divisions.To improve the network’s capability to adapt to multi-band inputs,this method employs a pyramid structure to facilitate model fusion and achieves multi-scale feature fusion through a spatial attention residual refinement module.Secondly,to fuse high-frequency and low-frequency domain information,the proposed spatial attention residual refinement module is employed not only to extract features from the colour domain,but also to recover detailed information from multi-level frequency bands.Finally,to validate the robustness of the model in high noise environments,generic additive noise depth estimation datasets are constructed by simulating various noise properties.On the KITTI depth dataset,the proposed method reached 0.0690 for Abs Rel,0.3020 for SqRel,2.6520 for RMS and 0.1120 for Log RMS.(3)Self-supervised monocular depth estimation via multiple bilateral consistency is proposed.The inter-frame-supervisied depth estimation method is a challenging task with two main problems.The first problem is that very few pixels are successfully matched in adjacent frame matching.The second problem is that in reprojection matching,the geometric relationships established are are not reliable.To address these problems,this paper establishes multiple bilateral consistency constraints on the inter-frame supervision method.Firstly,for pixels that are not utilized in adjacent frame matching,the method re-renders the depth map as a RGB image through the re-rendering network,thus establishing a cycle-consistent framework together with the depth estimation network.Secondly,for establishing reliable inter-frame constraints,pose consistency and depth consistency are proposed.The poseconsistent constraint aims at ensuring the reversibility of ego-motion transformations between adjacent frames,and similarly the depth consistency constraint aims at ensuring the continuity of the depth of adjacent frames.On the KITTI depth dataset,the proposed method,pre-trained on the Cityscapes urban scenes dataset,reached 0.0990 for Abs Rel,0.7180 for SqRel,4.4080 for RMS and 0.1790 for Log RMS.(4)Joint self-supervised depth and optical flow estimation towards dynamic objects is proposed.In outdoor scenes,the original inter-frame supervised depth estimation method suffers from mutiple crucial problems.One is that the original one-way inter-frame constraint has low interpretability and robustness.Second,the original inter-frame constraint does not address the dynamic objects.To address these issues,this work introduces optical flow information into the inter-frame supervised depth estimation method by jointly estimating the depth map and optical flow map to predict the objects’ relative motion and accurate depth in scenes.Hence,estimating and describing the dynamic object motion properties through optical flow networks,this chapter constructs a joint self-supervised depth and optical flow estimation framework for outdoor scenes,optimizing optical flow estimation,depth estimation and pose estimation networks simultaneously by constraining photometric reprojection errors and optical flow reconstruction errors.In optical flow-based motion region segmentation,the initially estimated optical flow map is adaptively segmented by the determination of non-connected regions.For the inter-frame supervised depth estimation,the depths of the different motion regions are estimated independently and then the complete depth map is composed.Furthermore,the camera pose matrix and depth map are resynthesized as optical flow maps,and the reconstruction error is then computed with the initially estimated optical flow maps.On the KITTI depth dataset,the proposed method,pretrained on the Cityscapes urban scenes dataset,reached 0.0940 for Abs Rel,0.6030 for SqRel,3.8920 for RMS and 0.1640 for Log RMS.(5)Self-supervised monocular depth estimation on water scenes via specular reflection prior is proposed.Besides the inter-frame supervision,extensive reliable depth information exists within frames for specular reflections in water reflection scenes.For specular scenes,the depth estimation task can be reformulated as a perspective matching problem for real scenes and virtual reflections.This work presents the first deep-learning work for depth estimation on water scenes via specular reflection priors.Firstly,to match the reflections and source patterns,the photometric adaptive SSIM,which is developed from SSIM,is introduced to focus on local contrast and structure.Secondly,to construct a general framework that is prone to re-implement,most lightweight backbones are employed for water segmentation and depth estimation.To address the absence of the dataset on reflection scenes,we create a largescale specular reflection dataset,the Water Reflection Scene dataset from Unreal Engine 4.On this water reflection scene dataset,the proposed method reached 0.1360 for Abs Rel,0.9990 for SqRel and 5.0100 for RMS.
Keywords/Search Tags:Outdoor scene depth estimation, Self-supervised learning, Robust depth estimation, Inter-frame-supervised depth estimation
PDF Full Text Request
Related items