Monocular depth estimation is a challenging research topic in computer vision.Using single images taken from real scenes for scene geometry estimation can improve the perception and understanding of real 3D scenes and play an important role in many vision tasks such as scene reconstruction and viewpoint synthesis.Compared with deep convolutional networks,self-attention networks have characteristics such as establishing long-range dependencies and rapidly obtaining global perceptual fields,which are beneficial to the quality of monocular depth estimation.For this purpose,the following three-fold researches are proposed:(1)To address the problem of insufficient to capture the intra-/inter-correlation among image tokens in the multi-head self-attention of Transformer,which makes the feature fine-grained of the self-attention maps low and unfavorable to predict the inter-object scale and the depth of tiny objects.A self-attention network architecture optimized by token attention is proposed.By applying token attention,the weighted redistribution of features while enhancing the connection between token features enhances the self-attention to tiny objects and improves the depth estimation accuracy.To improve the generalization performance,the network is trained in affine invariant depth using a mixed depth dataset.Numerical experiments show that the quantitative and qualitative results of the proposed algorithm are significantly improved.(2)To address the problem of depth temporal inconsistency problem between the partial previous and subsequent frames when self-attention monocular depth estimation network applied to videos or continuous image sequences,i.e.,there is variation and flicker in the depth results of the same object in the previous and subsequent frames.An optical flow-based motion constraint loss is proposed,which uses the optical flow between previous and subsequent frames during the training process,warps the image,calculates the non-occluded region,and minimizes the depth error in this region using a more robust smoothing L1 parametric penalty.The algorithm effectively improves the temporal consistency of the network in predicting the depth of continuous images by adding explicit motion constraints.Numerical experiments show that the proposed algorithm obtains further improvement in both quantitative and qualitative results compared to the algorithm in the previous chapter.Generalization experiments and validation are also performed on real scenarios.(3)For the demand of obtaining efficient and accurate video depth in practical engineering applications,a real-time video depth estimation engineering application based on cloud framework visual interaction is carried out to provide cross-platform high quality depth image generation of images or videos.The engineering application is designed and implemented with modules for uploading,depth estimation,image display and control,storage and download.The core algorithm for depth estimation uses the depth estimation network in Chapter 4 and is optimized using the Tensor RT deep learning accelerated inference framework to achieve engineering validation of time-consistent estimation of monocular video depth while ensuring prediction speed. |