In 3D space vision,an underlying task is to obtain the depth information of objects in the scene.Currently,the method of directly using sensor devices to obtain depth information is expensive and vulnerable to environmental interference.Therefore,the method of using algorithms to obtain depth information from RGB images has become a research hotspot.Given monocular image or binocular image pair,accurate depth estimation is of great significance in such application fields as robot obstacle avoidance and automatic driving.Monocular depth estimation only needs a single RGB camera to get the image,and then obtains pixel-level depth information through the algorithm.It is a low-cost ranging method,but it is an ill-posed problem,and the algorithm has low robustness.Binocular depth estimation,namely stereo matching,aims to find homonymous points from the image pair for matching,so as to obtain the disparity of image pixels,and then calculate the depth information of image by combining the camera baseline length and focal length.The depth information obtained by this method is more accurate and the algorithm is more robust,but it cannot deal with ill-posed regions such as weak texture and occlusion.Based on the deep learning method,this paper conducts research on monocular and binocular depth estimation tasks respectively.Focusing on technical difficulties such as improving the accuracy of depth estimation and reducing the complexity of models,theoretical analysis,method implementation,experimental verification and other works are carried out.The research contents are as follows:(1)Monocular depth estimation based on parallel decoderIn the traditional method of using encoder-decoder structure to regress the depth information of monocular image,the decoder usually uses a serial method to fuse the encoder features from small scale to large scale,and finally outputs the depth map.This method is simple and efficient,but it is difficult to recover the spatial position information lost after a series of convolution and pooling operations of the encoder.To improve this serial method,this paper proposes a structure from the perspective of the decoder,which first predicts the global and local depth information in parallel,and then uses an improved self-attention mechanism–based module to fuse them.The results show that the structure has the comparable accuracy as the most advanced methods in indoor and outdoor scenes,and has less parameters and calculations.(2)Binocular depth estimation based on learnable multi-scale cost volumesIn the deep learning based binocular depth estimation task,one of the most important steps is to construct matching cost volume for left and right view features.At present,the widely used method is to construct a group-wise correlation cost volume,which uses traditional mathematical experience to measure the similarity of two sets of feature vectors.It is difficult to match pixels in weak-texture and occluded regions.To solve this problem,this paper proposes a learnable multi-scale matching cost calculation method.Using the method to calculate the matching cost can reasonably estimate the disparity in the difficult matching regions.In addition,this paper introduces multi-level dilation convolutions and multi-scale cost volumes due to the receptive region of the convolution kernel is limited.The experimental results show that the method in this paper has better matching accuracy and lower model complexity than using the group-wise correlation cost volume. |