| Depth estimation plays a crucial role in environmental perception and understanding.It is not only a frontier direction of computer vision,computer graphics,and robotics,but also widely applied in intelligent driving,intelligent medical care,and virtual reality.Therefore,it has been received common attention and research from academia and industry areas.The core task of depth estimation is to recover the depth value of each point from the scene image and reconstruct the depth image.According to different acquisition devices,scene images can be divided into monocular images,binocular images,and polycular images.Compared with binocular and polycular images,monocular images are more widely utilized in the real life and require greater analysis.Additionally,monocular depth estimation techniques face more challenging,due to the facts that monocular images present less scene information and the monocular depth estimation is inherently ill-posed.Therefore,monocular depth estimation is not only a requirement for practical applications,but also a theoretical research hotspot and research difficulty in the field of depth estimation.The research on monocular depth estimation technology usually focuses on key technologies about depth feature encoding,depth feature decoding,and depth regression.This paper investigates and analyzes the existing monocular depth estimation algorithms based on deep neural networks,and finds that the problems and reasons difficult to solve as following:(1)Depth blurs and boundary artifacts:Existing methods usually assume moving objects as static objects or adopt masks to model moving objects separately,lacking the learning and research of polymorphic information in the scene,which leads to the recovery of depth blur and boundary artifacts still exist in the estimated depth maps.(2)Scene space structure drifts:Existing methods often only focus on the depth information of the current view,ignoring the temporal correlation of adjacent views.It results in mis-matching and mis-mapping of corresponding feature points in adjacent views,which makes spatial structure drifts still exists in the restored depth map.(3)The loss of scene detail information:Existing methods often design deeper or more complex continuous regression networks to predict monocular depth maps,where the multiple convolution,pooling,and deconvolution results in the loss of rich detailed features.It weakens the scene geometry,leading to the loss of local scene information and incomplete boundary information in the recovered monocular depth map.For the problems mentioned above,this paper conducts research on monocular depth estimation from a new perspective,and proposes some new networks and implementation algorithms for depth feature encoding,depth feature decoding,and depth regression.A breakthrough has been achieved in the problems about depth blurs and boundary artifacts,scene space structure drifts,the loss of scene detail information.The main research work and innovation points of this paper are as follows:(1)For the problem of object depth blurs and boundary artifacts,a joint feature encoding method based on polymorphic information(PINet)is proposed,which adopts a multi-information joint learning strategy to encode and learn the polymorphic(static and dynamic)information of the scene at the same time.In order to enhance the correlation and consistency among the polymorphic information in the joint coding,a motion consistency constraint function and a pose consistency constraint function are designed and proposed,to improve the prediction accuracy of scene depth information,and effectively alleviate the problem of object depth blurs and border artifacts.Experiments and results on the KITTI and TUM datasets show that the proposed PINet is feasible,and has the best overall performance on the scene depth estimation task compared with other methods in recent years,where the estimated accuracy rate δ<1.25 is improved by 1.6%and 5.3%respectively compared with the best methods in recent years.(2)For the problem of scene spatial structure drifts,a monocular depth feature decoding method based on spatial and temporal attention(STDepth)is proposed.In order to enhance the learning and mapping of scene spatial structure information and temporal correlation information,a selfattention strategy based on fusion features is designed and adopted,which adaptively selects and enhances the mapping and expression of spatial features in the current view,strengthening global consistency of the local features;a mutual attention strategy based on corresponding features is designed and adopted to select and enhance the correlation and mapping of temporal features between adjacent views,strengthening the long-term dependence of corresponding features to effectively alleviate the problem of scene spatial structure drifts.Experiments and results on the KITTI and NYU Depth V2 datasets shows that the ST-Depth can achieve the designed goal,and the estimated error rate RMSE value is 3.2%and 0.6%lower than the optimal method in recent years.In addition,the experimental results on the Make3D dataset show that the ST-Depth has strong transfer and generalization ability.(3)For the loss of scene detail information,an ordinal regression network with weighted inference for monocular depth estimation(WIORNet)is proposed.In order to effectively restore the detailed and complete monocular depth map of the scene,an incremental discretization strategy is introduced.The hierarchical fusion,attention enhancement,and residual optimization methods are designed and adopted to enhance the ability to describe scene details.A weighted inference function combined with the prediction probability of the depth label is proposed to reduce the erroneous expression of detail features and the error of depth inference during depth regression.Experiments and results on the KITTI and NYU Depth V2 datasets show that the WI-ORNet method is effective.Among them,the estimated accuracy δ<1.25 on the[0m,50m]of KITTI dataset is improved by 2.5%,and the estimated performance on the[0m,80m]of KITTI dataset is comparable to the best method in recent years;The accuracy of the δ<1.25 on the NYU Depth V2 dataset is improved by 7.0%.In addition,the convergence speed and deep inference time of the depth estimation model are also more competitive. |