The 3D point cloud models the spatial distribution and surface characteristics via the discrete 3D point set on the target surface.It contains complete 3D coordinates and has a simple form,high precision,and large scale.Mining and understanding 3D point cloud data through intelligent technology can capture subtle changes in people,objects,and scenes and realize the perception and understanding of the physical environment.In recent years,with the continuous advancement of 3D sensor technology and the continuous expansion of 3D application scenarios,3D point cloud semantic segmentation has received extensive attention and has wide application in many fields,such as autonomous driving,robotics,smart cities,virtual reality,and remote sensing observation.In particular,the emergence of deep learning technology has made significant progress in 3D point cloud semantic segmentation.However,there are still two problems in the existing deep learning-based 3D semantic segmentation point clouds:(1)The 3D point cloud is sparse,disordered,irregular,and unstructured,so it is hard to extract features effectively and represent semantic patterns;(2)The 3D point cloud can be converted into multiple data forms,it is hard to design deep networks for different data forms,and using the advantages of multiple data forms to improve the performance of point cloud segmentation is an open problem.In response to these problems,we take the feature fusion method in the deep network as the research clue and the main research contents are as follows:(1)A Backward Attentive Fusing Network with Local Aggregation Classifier is proposed to alleviate the semantic gap between the encoder and decoder features in the fusion process of the point-based point cloud semantic segmentation methods.The high-level encoder features are converted into attention maps via a backward attentive fusion mechanism.Then the highlevel encoder features are modulated to improve their semantic level and bridge the semantic gap between encoder and decoder features.In addition,to improve the context information in the point-wise classifiers based on multi-layer perceptrons,Local Aggregation Graphs are embedded in multi-layer perceptrons for context interaction to enhance the local consistency of segmentation results.The experimental results show that this method can enhance the final features’ semantic discrimination,improve the local consistency,and reduce the segmentation noise.(2)A Waterfall Feature Aggregation Network is proposed to provide more combination of point cloud density and semantic granularity for point-based point cloud semantic segmentation.This method obtains point cloud features of different densities through cascaded sub-networks.It gets a rich combination of point cloud density and semantic granularity through feature crossfusion between sub-networks.Among them,the cascaded sub-networks process input point clouds from different densities.The point cloud density is improved horizontally between the sub-networks through the learnable point cloud upsampling method,and the semantic information is transferred.The intermediate features are vertically fused between the subnetworks to obtain features with different point densities and semantic granularity combinations.Experimental results show that this method can improve the diversity of the combination of point cloud density and semantic granularity and generally improve the segmentation effect of simple objects and complex targets.(3)A Geometry-Injected Image-Based Point Cloud Network is proposed to solve the geometric distortion caused by the projection process in image-based point cloud semantic segmentation.It incorporates the geometric prior of the point cloud into the network structure design,thereby improving the geometric fidelity of the segmentation process.For the truncation problem,circular convolution aggregates the features of the adjacent 3D points projected into the image’s different sides.A Dual Geometric Constraint is introduced for dislocation and empty pixels.Local Spatial Attention modulates the dislocated and empty pixels to reduce their impact on convolution.Local Affinity Regularization provides intermediate supervision for intermediate features to facilitate feature extraction within the convolutional neighborhood.Experimental results show that this method can alleviate the problem of geometric distortion caused by the projection process,thereby improving the effect of point cloud semantic segmentation.(4)A Point-Image-Voxel fusing Transformer is proposed to enhance the semantic awareness and adaptability in the process of multimodal data fusion in the hybrid point cloud semantic segmentation.This method realizes feature interaction and fusion of three modal data through Target-Attention and Attentive Fusion.Among them,Target-Attention measures the semantic relationship between point-image and point-voxel and adaptively fuses features according to the semantic relationship.Attentive Fusion generates attention coefficients based on geometric information and semantic features,and fuses image-point and voxel-point features.In addition,the fusion of multimodal results adaptively fuses the prediction results of each branch to increase the multimodal data interaction process.Experimental results show that this method can effectively fuse point cloud,image,and voxel data and achieve a better point cloud semantic segmentation result. |