| Monocular depth estimation is very challenging due to scale ambiguity issue,e.g.,different camera positions produce different depth maps for the same scene.We observed that such ambi-guity significantly affects the learning of CNN model.Huge amount of data and sophisticated post processing are often required.Inspired by the depth perception of human,we propose to decouple the depth estimation into two components,i.e.,scene structure and depth scale,which can significantly accelerate the net-work convergence and improve estimation accuracy.Moreover,the depth detail is further improved by a gradient based refinement module.The estimated high-quality depth map can benefit many applications such as 3D reconstruc-tion,which typically requires fine geometry details and accurate object boundaries.Our monocular depth estimation network adopts a Divide-and-Conquer strategy,named DCNet for short.Even without sophisticated network structure or loss function design,our method still obvi-ously outperforms the state-of-the-art.Meanwhile our scale decoupled module and gradient re-finement module in DCNet are orthogonal to other advances in elaborate CNN structure and loss design.We believe more carefully designed network structure and loss function can further benefit our model.To summarize,we have made substantial contributions over the following aspects:a.We argued that the monocular depth estimation can be naturally decoupled to the scene structure and the depth scale.Based on that,we have proposed a scale decoupled module to sepa-rately learn these factors,achieving high convergence rate and promising depth estimation quality.b.We have developed a gradient refinement module to refine geometry details.Our DCNet with scale decoupled module and gradient refinement module outperforms the state-of-the-art. |