As one of the most challenging classical tasks in computer vision tasks,depth estimation aims to predict the depth information of objects in two-dimensional scene images.Depth estimation algorithms provide prior information for 3D reconstruction and are widely used in the fields such as robotics and autonomous driving.With the vigorous development of deep learning,depth estimation has also achieved remarkable predictive results.Compared with the supervised depth estimation network,self-supervised depth estimation is unlimited by the high-cost label acquisition.In this process,self-supervised networks utilize the temporal consistency between image sequences,and the depth map is continuously optimized as an intermediate product of 3D reprojection.In this paper,we analyze the existing self-supervised depth estimation models based on monocular image sequences and optimize the depth estimation network combined with semantic information.Then we design a scene information processing system and a 3D reconstruction system based on the depth estimation algorithm and make further explorations in the application of depth estimation.The main contributions of our work are divided into three aspects:(1)We propose a depth estimation network based on multi-scale feature extraction and fusion.Common depth estimation networks use downsampling to expand the receptive field to extract high-level semantic information,but lose important spatial details.In this paper,we redesign the encoder architecture by using the multi-scale convolutional flow and multi-scale feature fusion module to perform multi-scale feature subspace in parallel while preserving feature resolution.Then,the multi-scale features are sent to the feature enhancement module based on the self-attention mechanism to learn the correlation between non-local areas,enhance the fusion effect between the multi-scale features and enrich spatial and semantic information.We conduct the training and testing on the road driving data set.The experimental results show the model has improved the accuracy of quantitative results and qualitative results with sharper edges and small objects.(2)We present a self-supervised depth estimation model combined with semantic information.Research and analysis have shown that the semantic information in the scene has a strong correlation with depth information.According to the survey,we use a multi-task network with the shared encoder and multi-branch decoder for prediction to perform depth maps and semantic masks simultaneously.Between the depth estimation branch and semantic segmentation branch,We embed a cross-task interaction module between the two tasks to enhance the information flow of depth features and semantic features.First,the output of each decoding layer is subjected to the channel and spatial context extraction,and adaptive feature refinement is performed for specific tasks.Then,the representations are sent to the cross-task multi-layer module to take advantage of the category information and enrich the gradient information in the depth map.In addition,the reconstruction process and label consistency of semantic masks between the adjacent frames are constrained to guide the optimization direction of the deep network.Finally,for the self-supervised depth estimation algorithm combined with semantic awareness,this paper conducts experiments on the general indicators of depth estimation task and semantic segmentation task to verify the effect of the algorithm.(3)We design and implement a scene processing system based on the selfsupervised depth estimation model.Depth estimation makes it possible to recover three-dimensional spatial structure from two-dimensional images and has its practical application significance.In the existing 3D reconstruction system,the reconstructed model is regarded as a whole thing,and the objects cannot be processed.We not only develop a scene information prediction system based on the depth estimation model but also design a 3D model reconstruction system that can interact with different objects in the scene.The scene prediction system can predict the depth map and output the semantic mask for the uploaded scene image,analyzing the scene structure in the picture.The three-dimensional modeling reconstruction system takes advantage of the color images,depth maps,semantic masks,and other information to perform 3D reconstruction on different objects in the scene through data preprocessing,point cloud reconstruction,data fusion,surface generation,scene combination,and other processes.This system is an attempt to implement the application of depth estimation and is used to explore the development direction of subsequent engineering in this field. |