| Technologies such as autonomous driving,virtual reality,and augmented reality have developed rapidly in recent years.As an indispensable part of these technologies,depth estimation has received extensive attention from researchers.Monocular depth estimation based on deep learning is to use the ability of convolutional neural networks to extract abstract features to obtain complex depth cues on two-dimensional images,thereby avoiding the high cost of using traditional hardware devices such as lidar and millimeter wave radar.The equipment is not easy to be embedded and other issues.Unsupervised monocular depth estimation usually uses an encoder-decoder network.How to effectively fuse the high-and lowlevel feature information and reasonably utilize the depth features of objects at various scales is a challenging task.In addition,a large number of phenomena that exist in reality,such as textureless regions and occlusion,will also bring difficulties to the network.The thesis mainly analysis and research on how to effectively fuse multi-scale information and solve the problems of textureless regions.The main innovations and achievements are as follows:(1)Unsupervised monocular depth estimation based on dense feature fusion.Aiming at the problems of low feature reuse rate and insufficient fusion caused by feature fusion by using skip-connections on the same layer of the U-shaped encoder-decoder,an unsupervised monocular depth estimation based on dense feature fusion is proposed.First,design a dense feature fusion layer to fuse high-and low-level features and low-resolution disparity maps in the form of channel stacking and convolution.Then deploy the dense feature fusion layer between encoder-decoder in the form of dense connections,instead of the previous skipconnection improves the reuse rate of each layer feature.Finally,the encoder network is channel-cut to achieve the performance balance between the encoder-decoder.On the KITTI dataset,the threshold accuracy is increased to 85%,the absolute relative error is reduced to0.122,and the remaining 5 indicators are all improved.On the Make3 D dataset,the absolute relative error dropped to 0.497,and the other three indicators all improved.(2)Unsupervised monocular depth estimation based on balanced multi-scale.Aiming at the feature dilution problem caused by the feature fusion of the existing network using the Ushaped encoder-decoder and the skip-connection,the loss of spatial information caused by the feature extraction stage of the encoder,and the scale imbalance problem that often exists in the scene,an unsupervised monocular depth based on balanced multi-scale is proposed.First,the dilated convolution is added to the last two blocks of the encoder,which reduces the number of downsampling and retains more spatial details.Then a balanced multi-scale module is designed to extract multi-scale features in a pooling manner,and then proceed the balanced fusion operation,with the attention mechanism to further optimize the fusion features,obtains rich and low-redundancy balanced multi-scale information.Finally,the high resolution and large receptive field features output by the encoder are input to the balanced multi-scale module.The two cooperate with each other,which greatly improves the network performance.The threshold accuracy on the KITTI dataset is increased to 88%,the absolute relative error is reduced to0.104,and the remaining 5 indicators are all improved.On the Make3 D dataset,the absolute relative error dropped to 0.330,and the other three indicators all improved.(3)Unsupervised monocular depth estimation for large textureless regions.Aiming at the phenomenon of large-scale textureless areas caused by large areas of water and sky in the USVInland dataset.Three improvements are proposed: disparity initialization loss,horizontal gradient consistency loss,and textureless mask.First,use the textureless mask algorithm to extract the textureless area in the image.Then use the disparity initialization loss to act on the area recognized by the textureless mask,which changes the loss landscape of the entire network,so that the network is not easy to fall into local minima in the training process.Finally,the horizontal gradient consistency loss is applied to the area identified by the textureless mask to make the horizontal direction as smooth as possible.The three cooperate with each other so that the textureless area can also predict a reasonable depth.On the USVInland dataset,the threshold accuracy is increased to 64.9%,the absolute relative error is reduced to 0.37,and the other 5indicators are all improved. |