Multi-view stereo(MVS) reconstruction is an important and worthwhile long-term research topic in the computer vision field.The technology plays an important role in the fields of virtual reality(VR),digital twins,and digital conservation of cultural relics.As the demand for the application of this technology continues to increase in various industries,the reconstruction quality of multi-view stereo has been subjected to more stringent requirements.Recently,the rise of deep learning has facilitated the rapid development and application of computer vision tasks.The study of multi-view stereo reconstruction using deep learning has likewise become one of the research hotspots for computer vision tasks.Learning-based multi-view stereo reconstruction methods significantly improve the performance of scene reconstruction.However,the existing research methods have not achieved ideal reconstruction results in regions of the scene to be reconstructed that lack reasonable geometric assumptions,such as weakly textured regions with thin structures and repetitive textures.Therefore,how to effectively implement multi-view stereo reconstruction in regions with difficult reconstruction is an urgent problem to be solved.Meanwhile,how to explore the geometric topology implied in the scene during reconstruction and how to improve the consistency of feature matching among multi views are also issues that deserve to be extensively investigated.To this end,starting from the above problems,the thesis deeply studies and gradually proposes several multi-view stereo reconstruction methods based on deep learning technology and the basic theory of multi-view stereo.The main research works and contributions include the following:(1)For the problem of reconstruction difficulty of weak texture regions by multi-view stereo reconstruction methods,the thesis proposes the multi-view stereo network with point attention(PA-MVSNet).First,the network roughly estimates the low-resolution depth map of the reference view and projects the coarse depth map as a 3D point cloud based on the camera pose of the reference view;Then,during the feature learning and aggregation of point clouds,the important intra-and inter-point features are perceived dynamically through point attention mechanisms in two spatial dimensions;Finally,high-resolution depth maps are predicted by the iterative depth map up-sampling and feature learning of the point-attention mechanism.In addition,to enrich the images features,the thesis proposes to combine different pooling strategies to capture the contextual features of different regions in a multi-view feature extractor;The thesis proposes to use the high-frequency information of the image as feature residuals to supplement the detailed features lost in the process of down-sampling the image.Experiments show that the proposed PA-MVSNet can significantly improve the modeling problem of weakly textured regions,achieving the best reconstruction accuracy and overall metric at that time on the DTU dataset as well as a better generalization performance on the Tanks & Temples dataset.(2)For the problem that multi-view stereo reconstruction methods cannot fully perceive the geometric structure of the scene,the thesis proposes the network of exploring the point feature relation on point clouds for multi-view stereo(PFR-MVSNet).The network consists of three core modules: the dynamic structure perception module,the adaptive structure feature learning module,and the self-attention-based feature learning module.First,the dynamic structure perception module augments the point cloud features with large-scale 3D point cloud projection features;Secondly,the spatial structure features are established within the local regions of the point cloud,and the feature learning of the points is guided through the information aggregated by structural similarities;Thirdly,the adaptive structure feature learning module repartitions local regions of structural similarity based on perceived structure features;Finally,the self-attention-based feature learning module further learns the point features in new local regions.Experiments show that the proposed PFR-MVSNet effectively perceives the topological information implied in the point cloud,making the features learned by the network more consistent with the geometry of the scene as perceived by humans.The network shows the best reconstruction accuracy at that time on the DTU dataset and achieves better generalization performance on the Tanks & Temples and ETH3D datasets than the stateof-the-art methods at that time.(3)For the problem that the multi-view stereo reconstruction methods are inconsistent in multi-view feature matching,the thesis explores the fundamental causes of the problems by analyzing the reconstruction process of multi-vision stereo.And the thesis proposes the network of improving feature consistency across views for multi-view stereo(FC-MVSNet).On the one hand,a color invariant model is introduced,which can derive a set of color properties independent of changes in imaging conditions,thereby improving feature consistency across views and alleviating the data sensitivity of supervised methods;On the other hand,several pixel-level feature losses are proposed to further encourage the model to maintain and enhance consistent features across views during the image feature extraction.Experiments show that for pixels with the same spatial location and implication in multiple views,the proposed FCMVSNet still perceives similar visual representations when the imaging conditions change.The network achieves the best reconstruction completeness on the DTU and Tanks & Temples datasets compared to other methods,and also shows state-of-the-art generalization performance on the scenes of the ETH3D dataset. |