Depth estimation is an important task in scene understanding,which has high value and universal application in augmented reality,automatic driving,robot navigation and other fields.The traditional methods obtain depth data by distance measurement equipment,such as LiDAR,structured light camera and so on.These professional devices are not only expensive,but also have high requirements on the environment when collecting depth data.And the resolution of depth data is not high.The monocular image depth estimation based on deep learning can directly obtain the depth information from the color image taken by a single camera.It is not only cheaper but also more widely used.However,ordinary cameras lose depth information when acquiring 2D images,which makes a single image can correspond to countless real scenes.Therefore,monocular depth estimation is a very challenging problem.This paper studies the monocular depth estimation problem as follows:(1)The bleeding effect caused by stereo occlusion leads to bad results of the unsupervieded monocular depth estimation algorithm based on stereo image.This paper proposed a left-right circulation consistency constraint to reduce the influence of bleeding effect.In this method,a pair of new left and right views can be obtained by flipping stereo image pairs horizontally.The original right view can be regarded as the left view after flipping.Therefore,the left view and flipping of right view can be input to one network.And the network can generate disparity maps of left and right views respectively by the same logic.Then,corresponding disparity,the left and right views are reconstructed to form constraints.This training method makes the network learn information of left and right views at the same time.This paper uses KITTI data set to train the proposed network,and proves that this training method can improve the accuracy of depth estimation.(2)For the unsupervised monocular depth estimation algorithm based on image sequence,the complexity of the network is increased.Because a subnetwork is needed to predict the camera pose which is used to reconstruct the before and after frame images.In this paper,a simplified model is proposed to calculate the displacement by mulitplying the sampling interval times instantaneous velocity of the camera.This model only needs a depth estimation network,which reduces the complexity of the model.However,experimental results show that this simplified method decreases the accuracy of monocular depth estimation.(3)In order to improve the prediction accuracy of monocular depth estimation,this paper compares the auxiliary information of synthetic image and semantic segmentation,and then designs a multi-task learning model jointing semantic segmentation.The multi-task framework uses hard parameter sharing with mixture of experts network.The model replaces the original multi-gate networks by adjusting the connection mode of features extracted by each expert.Besides,some optimization methods are proposed for tasks in multi-task learning.The effectiveness of each optimization method is verified by experiments. |