| Image segmentation is a fundamental problem related to image scene understanding in the field of computer vision.With the continuous growth of large-scale data and the complexity of image scenes,single image semantics or instance segmentation cannot meet the refined scene understanding required for most computer vision tasks.Panoptic segmentation can make better use of semantic labels and instance features to achieve a comprehensive understanding of image scenes.However,the current panoptic segmentation methods with high segmentation accuracy require a large amount of computation,and the panoptic segmentation methods with high inference efficiency cannot meet the accuracy required for practical applications.The conflict between segmentation accuracy and operation efficiency has become a bottleneck in the deployment of 2D image and 3D LiDAR(Light Detection and Ranging)image panoptic segmentation technologies in practical application scenarios such as autonomous driving.Therefore,we conducted the following research to address the above issues:1.Comparative study on the performance of panoptic segmentation method under mainstream architecture.In this thesis,the current deep learning-based 2D image panoptic segmentation methods are roughly classified according to different architectures,including topdown two-stage architecture,top-down single-stage architecture,bottom-up architecture and single-path architecture.Aiming at the conflict between the inference efficiency of the network model and the segmentation accuracy,this thesis conducts performance comparison experiments on panoptic segmentation methods based on different architectures on two standard public datasets for 2D image,MS COCO and Cityscapes,including the comparison of network inference efficiency and segmentation accuracy.The experimental results show that the network model with higher panoptic segmentation quality has lower inference efficiency,while the method with high inference efficiency has unsatisfactory panoptic segmentation quality.The visual analysis comparing the experimental results and the segmentation prediction shows that the 2D image panoptic segmentation method based on the single-stage architecture has satisfactory inference efficiency,so the conflict between accuracy and efficiency in the 2D image panoptic segmentation task can be alleviated by improving its segmentation accuracy.2.A panoptic segmentation method based on pixel-level instance perception.In order to improve the segmentation accuracy of the 2D image panoptic segmentation method based on the single-stage architecture,this thesis proposes a panoptic segmentation method based on pixel-level instance perception in the Chapter 4.The method uses a feature pyramid network to extract multi-scale feature information from input images and transmits multi-scale feature maps to the semantic branch,object detection branch,and panoptic branch respectively.The FCOS object detector provides instance center information for the panoptic branch to assist in the generation of pixel-level instance-aware masks,and the semantic branch provides semantic information for the panoptic branch.The product of the semantic branch and the panoptic branch is the final segmentation result,avoiding the post-processing fusion process.The panoptic quality PQ(Panoptic Quality)and mIoU values on the standard dataset Cityscapes validation set are 59.4% and 76.2% respectively,and the running time is 78ms;the panoptic quality PQ and running time on the standard dataset MS COCO test set is 41.2% and 32 ms respectively.The panoptic segmentation performance is closer to the best two-stage panoptic segmentation method,but the running time is 140 ms and 88 ms lower than the corresponding best-performing Unifying on the MS COCO and Efficient PS on the Cityscapes datasets respectively.The experimental results comparison and visual analysis of segmentation prediction results show that the proposed method solves the problem of over-segmentation or insufficient segmentation of objects of different sizes,and achieves a good trade-off between accuracy and inference time.At the same time,the reasoning process is simple and can be optimized by using the deep learning inference engine,which reduces the difficulty of deployment in practical application scenarios.3.LiDAR panoptic segmentation method based on multi-scale cascaded attention.In Chapter 4,it is proved by experiments that the 2D image panoptic segmentation method based on the single-stage architecture can effectively alleviate the conflict between the inference efficiency and the segmentation accuracy by improving the segmentation accuracy under the premise of maintaining the segmentation efficiency.Extending the conflict resolution strategy in the above 2D image panoptic segmentation task to the 3D LiDAR image panoptic segmentation task,this thesis proposes a LiDAR panoptic segmentation network based on multi-scale cascaded attention with the thought of a single-stage architecture.Submanifold sparse convolution is used to perform sparse feature extraction and point-by-point feature information encoding on feature maps of different voxel sizes.Then,the sparse voxel features are extracted with voxel-level attention information using a multi-scale cascaded attention network.And,the correlation information between voxels and the global point-by-point information is used to supervise the entire network to improve the learning ability of good features.At the same time,the top-view sparse instance center distribution,multi-scale sparsity,instance center point offset,and semantic supervision information are output.Through the fusion of supervision information,the LiDAR panoptic segmentation result is finally obtained.On the standard Semantic KITTI test dataset,the panoptic quality PQ and mIoU obtained by the network are 60.9% and 71.2% respectively,and the average inference time of a single LiDAR3 D image is 68.5 ms;on the standard dataset Nuscenes test set,its panoptic quality PQ and mIoU are 63.5% and 75.4% respectively.The method proposed in Chapter 5 extracts sparse voxel features of LiDAR image data through the parallel encoding network with submanifold sparse convolution,removing a large amount of redundant information and improving the computational efficiency.And the panoptic segmentation prediction results with visual analysis shows that the attention alignment operation between multi-scale sparse features and global voxel sparse features improve the accuracy of large-scale object instance center regression.By capturing the correlation between voxels,the problem of excessive segmentation of large-size objects in LiDAR images is resolved. |