| Human beings live in a three-dimensional(3D)world and perceive the world in perspective of 3D.Different from human perception,existing computer systems mainly observe and analyze two-dimensional(2D)images.Consequently,it is of important implication to explore 3D object reconstruction to enable computer systems seeing the world as humans do.With the success of deep learning architectures and the release of large-scale 3D shape datasets,deep learning-based 3D object reconstruction methods have achieved great progress.However,based on the analysis of the present situation of research,the existing methods still encounter some problems.Therefore,aiming at the problems of the existing methods,the paper proposes a series of binocular sparse and dense point cloud reconstruction approaches based on deep learning.The specific description of the problems in existing methods and the original contributions of this paper can be summarized as follows:First,aiming at the limited visual cues for single-view 3D object reconstruction methods,and existing multi-view 3D object reconstruction methods are difficult to effectively capture the information of occluded region from input views for reconstructing the point cloud from the binocular images.This paper proposes a staged binocular sparse point cloud reconstruction model,termed DV-Net.It adopts a pair of RGB images with different views as inputs,and outputs a complete point cloud for given object in two stages.The first stage: DV-Net generates a relatively-robust point cloud for each inputs view,thus greatly reconstructing the shape of occluded region from the input view.The second stage: the model extracts the feature from the relatively-robust point cloud with different views.The extracted feature contains the information of the occluded regions from input views.Then,DV-Net aggregates these extracted features to overcome the limitation of single visual clue for the single-view methods and generate a complete sparse point cloud.Second,aiming at the correspondence learning of object regions across different views and the modeling of the dependency among different regions within an object in multi-view 3D object reconstruction.This paper extends the DV-Net with the idea of staged reconstruction and presentes a binocular sparse point cloud reconstruction architecture based on correspondence and dependency,called DVPC.It takes two views images as inputs,and progressively generates a refined point cloud.Firstly,a point cloud generation network is assigned to generate a coarse point cloud for each input view.Secondly,a dual-view point clouds synthesis network is devised.It can learn a high-quality correspondence among regions across two coarse point clouds in different views,so that DVPC can more accurately fuse the features from different views.Then,the synthesis network produces a relatively-precise point cloud via establishing the communication between the coarse point cloud and the fused feature.Lastly,a point-region transformer network is devised to model the dependency among regions within the relatively-precise point cloud.With the dependency,the relatively-precise point cloud is refined into a sparse point cloud with finer structure.Third,aiming at the extracting of fine-grained semantic from input views and the modeling of semantic association between input views in multi-view 3D object reconstruction;Meanwhile,aiming at the dense point cloud reconstruction from binocular images in the multi-view 3D object reconstruction.This paper integrates DV-Net with the idea of staged reconstruction into the transformer model and presents a dense point cloud reconstruction approach based on semantics-aware transformer proposed,named SATF.It is used to reconstruct dense the point cloud for given object from dual views RGB images.SATF is composed of two parallel view transformer encoders and a point cloud transformer decoder.Each view transformer encoder can learn a multi-level feature,facilitating characterizing fine-grained semantics from input view.The point cloud transformer decoder explores a semantically-associated feature by modeling a semantic correlation between multi-level feature from two input views,which describes the semantic association between views.Furthermore,it can further generate a sparse point cloud via using the semantically-associated feature.Last,the decoder enriches the sparse point cloud for producing a high resolution dense point cloud. |