| 3D scene reconstruction has been developed for decades,which has a wide range of applications in autonomous driving,virtual/augmented reality(VR/AR),robot navigation,medical-CAD modeling,etc.Traditional geometry-based methods rely on accurate image feature correspondence across views to obtain camera poses and 3D reconstruction.Pose estimation of these methods can face failures caused by unreliable feature point correspondence and divergence in optimization.Data-driven learning-based approaches were proffered to mitigate these issues using deep models,and they often consist of tens of millions or even billions of optimization parameters to achieve pose estimation and dense reconstruction,which is time-consuming and the predictions may be inconsistent between frames.This thesis proposes a test-time joint optimization approach that enjoys the benefits of robust monocular depth estimation while only dozens of parameters need to be optimized.The key finding is that,when being trained at a sufficient scale,a monocular affine-invariant depth estimation model can successfully transfer the seemingly weak geometry prior to challenging diverse scenes.During the transformation,the unrectified scale-shift values of affine-invariant depth and the unknown camera intrinsic and extrinsics are the main barrier to realize 3D scene reconstruction.Thus,we leverage 6.3 million RGBD images to train a robust monocular depth model,and propose to tackle the inconsistency problem between frames with a novel scale-shift alignment module.Then,we freeze the affine-invariant depth model’s outputs,and quickly rectify them by optimizing dozens of parameters for each video frame.The unknown scale-shift values of affine-invariant depth are aligned by a global alignment module and a local alignment module with learnable depth parameters,and the resulting scale-consistent per-frame depth maps can be used to robustly obtain camera poses and dense scene reconstruction even in low-texture regions.The contribution and innovations of this thesis are summarized as follows:(1)This thesis analyzes the advantages and disadvantages of existing 3D reconstruction methods and proposes a 3D reconstruction framework based on robust monocular affine-invariant depth estimation.A robust monocular depth model was trained with the collected 6.3 million RGBD images with different annotation qualities,and the robustness and accuracy of the model were validated on five zero-shot testing datasets.(2)This thesis analyzes the impact of inaccurate scale and shift values in monocular affine-invariant depth estimation on 3D reconstruction tasks,as well as the limitations of traditional global alignment methods.A local alignment module is proposed to address inter-frame consistency issues.The performance improvements on five zero-shot testing datasets and ablation studies can demonstrate its effectiveness and robustness.(3)The proposed local alignment module can not only improve depth estimation and 3D reconstruction performance but also decouple depth estimation errors into coarse misalignment errors and detail missing errors,which can be used to analyze the bottleneck of existing depth estimation methods.(4)By leveraging the robust monocular depth estimation module and the local scale-shift alignment module,an optimization process is designed to jointly optimize the camera intrinsic parameters,camera poses,and sparse depth points,which enables 3D reconstruction with a simple pose-free video as input.Experimental results on 5 zero-shot testing sets show the effectiveness of our pipeline. |