The task of monocular depth estimation is a crucial area of research in computer vision.Its aim is to produce pixel-level depth maps from RGB images in a single view.This depth information helps to better understand 3D scenes and has a wide range of applications in areas such as scene reconstruction,autonomous driving and robot navigation.Computer vision tasks have long been dominated by deep convolutional neural networks,and in recent years,Transformer-based deep learning network architectures have shown superior performance in several tasks in computer vision.In this paper,we investigate the applicability and improved network architecture of Swin Transformer,a derivative architecture of Transformer,for monocular depth estimation tasks.In pixel-level image tasks such as monocular depth estimation,where highresolution images are used as input,the design of the model computation and computational complexity is critical in determining whether a network model can be trained with both high efficiency and high accuracy.For monocular depth estimation tasks,some previous work has been devoted to proposing network architectures based on Vision Transformer,however,the computational effort and complexity of these network models are large and not fully applicable for intensive predictive vision tasks.For supervised learning,in this paper,we use a layered Transformer,the Swin Transformer,as a feature extraction encoder for monocular depth estimation,and an adaptable decoder based on a spatial resampling module and Refine Net for different variants of the encoder.To verify the effectiveness of the network structure,experimental analyses are conducted on the open dataset NYU Depth V2 for monocular depth estimation.Experiments show that the codec structure proposed in this paper,fine-tuned on the dataset,can also yield substantial improvements in the intensive prediction task of monocular depth estimation,and the experimental results,compared to the Transformer model DPT-Hybrid,can achieve better depth estimation results compared to the Transformer model DPT-Hybrid.In addition,this paper proposes a Grad-CAM-based visual evaluation model to visualise and analyse the depth estimation model proposed in this paper layer by layer to dissect its high usability step by step.In recent years,there has been an increasing overlap between NLP and computer vision in basic modelling and learning algorithms and multimodal applications.Masked image modelling MIM is a sub-task of masked signal prediction,which masks a portion of the input image and allows deep networks to predict the masked image conditional on the visible image for representation learning in an unsupervised manner.In this paper,we propose Sim MIM-based monocular depth estimation by migrating the parameters of the Swin Transformer model pre-trained by Sim MIM on the Image Net dataset to initialise the weights of the encoders in this paper.The experimental results show that the Sim MIM model using Swin Transformer as the infrastructure can bring performance improvements to the results of fine-tuning experiments in monocular depth estimation,and at the same time can solve the data starvation problem caused by the growth of model capacity to a certain extent. |