| The human race perceives perceives a three-dimensional world,and with the growth of emerging technologies such as autonomous driving,metaverse,and digital twin,the necessity for three-dimensional data continues to grow.The 3D reconstruction methods based on traditional geometric vision are computationally intensive,have poor real-time performance and require high data quality.Advances in deep learning techniques and the availability of large-scale public datasets have offered the possibility of 3D reconstruction based on deep networks.Therefore,this thesis investigates a deep learning-based multi-view 3D reconstruction algorithm,with an emphasis on enhancing feature characterization to improve the accuracy and completeness of the reconstructed 3D models.The main work of the thesis is summarized as follows.(1)A Transformer-based multi-stage multi-view stereo network TM-MVSNet is proposed,which adopts a "coarse to fine" structure to extract features,and then uses differentiable homography to construct a three-dimensional cost volume from the two-dimensional feature map,and regularises the cost volume to achieve the smoothing of depth information,ultimately realising the prediction of the depth map from coarse to fine.In the feature extraction stage,a three-level feature aggregation module is engineered to retrieve multi-scale image features and use operations such as connection and concatenate to achieve aggregation of features at different scales for better extraction of complex semantic information.Moreover,in order to acquire global contextual information,the Transformer structure based on the self-attention mechanism is used to learn the internal dependencies of features more effectively and enhance the features.It is experimentally verified that this algorithm can effectively improve the accuracy and completeness of 3D reconstruction.(2)Based on TM-MVSNet,the self-attention and position coding are improved to obtain an optimised multi-view stereo vision network.Firstly,linear self-attention is used in the self-attention calculation,which can effectively reduce the graphics memory occupation compared with the native self-attention calculation,and can handle higher resolution images,as well as improve the network performance;secondly,in order to make the model applicable to images of different resolutions,the native learnable position coding is discarded and replaced with sine and cosine position coding to add position information to each feature sequence,which can further optimise the effect of 3D reconstruction.In this thesis,we conduct experiments on the publicly available datasets DTU and BlendedMVS.The method in this thesis was higher than CVP-MVSNet,AACVP-MVSNet,TransMVSNet,and TM-MVSNet by about 14.96%,4.85%,17.8%,and 1.31% in the metric Overall,respectively.In addition,the visualization performance indicates that the 3D point cloud structure is more complete and generalized by the method reconstructed in this thesis. |