| 3D reconstruction can recover the spatial structure of scene objects.It is widely used in fields like automatic driving,medical image processing and virtual reality.As an important approach for 3D reconstruction,multi view stereo(MVS)recovers the 3D spatial information from 2D images based on the principles of photography.Therefore,MVS is cost efficient,and can easily generalize to various scenes.Traditional MVS methods rely on handcrafted features.They achieve satisfactory performances under ideal Lambertian scenarios,but often fail in regions with low texture and reflections.In recent years,deep learning has been widely used in computer vision.The deep learning methods automatically learn global features of the inputs through convolutional neural networks(CNN).Compared with traditional handcrafted features,the deep features contain more semantic information.Inspired by this,we proposed a MVS approach based on deep learning.Our work builds deep neural networks for robust matching,which improves the performance of MVS.We first pre-process the images to estimate camera parameters,determine depth ranges and select images for matching.Then,we build deep neural networks to estimate the depth map for each view.To achieve this goal,we first extract features of the images with a 2D CNN.Next,we build 3D cost volume via homography warping and regularize the cost volume with a 3D CNN for initial depth estimation.Finally,we filter the outliers based on estimation confidence and refine the initial depth maps with the residual network.For post-processing,we filter the depth maps by depth consistency,and fuse the final depth maps to form dense point clouds.We train the model on the DTU dataset with TensorFlow and Python.The model is tested on both the DTU testdata and our self-collected data.Its performance is compared with those of VisualSFM,Bundler,OpenMVG,MVE and Colmap.The results show that the proposed method achieves high precision and completeness.More importantly,it outperforms the traditional methods in handling low texture and reflective conditions.Besides,we look into the influence of images used for matching.We find that more images result in better performance.What's more,we measure the running time of our system.It is shown that our method is fast,with the average time cost for single view depth estimation to be around 4.8 seconds. |