| Semantic segmentation of images is a fundamental and challenging visual perception task in street scene understanding,which aims to classify each pixel of the input image into a region by dividing pixels belonging to the same category and output labels with semantic information.Perceiving street scene information is crucial for autonomous vehicles to make correct judgments and plans.Semantic segmentation has become a hot research topic as a critical technology for scene understanding.In recent years,the development of convolutional neural networks has brought new opportunities for semantic segmentation technology,widely used in medical diagnosis,video surveillance and other fields.However,there are still some problems that need to be solved.Semantic segmentation networks lose much spatial detail information while extracting features to increase the receptive field.Currently,most semantic segmentation algorithms pursue high accuracy,which cannot meet the demand of real-time,and need to improve the inference speed of the model to meet the application in real-time scenarios under the premise of ensuring accuracy.In this paper,two aspects of segmentation accuracy and model inference speed will be studied as follows:(1)A model based on coordinate attention and strip pooling is proposed to address the problem of loss of detail information and discontinuity in semantic segmentation due to insufficient extraction and utilization of detailed features during feature extraction.The model uses Deep Labv3+ as the base network.In the encoder,a coordinate attention module is embedded in the backbone network Res Net to enhance the feature extraction ability.A parallel strip pooling and hole convolution module is designed to capture multi-scale context information with long-distance dependence.In the decoder,a semantic fusion module is used to fuse spatial location information and semantic information across layers to obtain more comprehensive feature information and fully utilize the original image features to improve the semantic segmentation accuracy.A feature fusion module is proposed in the semantic aggregation module to assign weights to the fused features.It was verified on the Cityscapes and Camvid datasets.Experiments demonstrate that our proposed method effectively overcomes the challenges of missed segmentation of small objects and blurred edge segmentation in streetscape semantic segmentation.The Mean Intersection over Union achieved is 77.2% and 70.2% on the Cityscapes and Camvid datasets,respectively.(2)To meet the requirements of accuracy and real-time performance in semantic segmentation,a real-time semantic segmentation network is proposed based on pyramid pooling,feature rectification and fusion.Existing real-time semantic segmentation methods often use feature fusion to improve segmentation accuracy.Still,these methods only partially utilize features of different resolutions,and the receptive field of the network is relatively limited.In this paper,firstly,a lightweight residual network Res Net18 is first used for feature extraction.Secondly,an improved pyramid pooling module is combined with the feature extraction network to enhance the representation of high-level semantic information.The decoder uses a feature rectification and feature fusion module to reduce redundant information before fusing features at different levels.Noise is first removed from features at different levels and then fused layer by layer with shallow features rich in spatial feature information to solve the problem of loss of detail information.Finally,a cross-entropy loss function based on Online Hard Example Mining is used to solve the problem of sample imbalance.The model is validated on the public streetscape dataset Cityscapes.The experiments demonstrate that our proposed method achieves a Mean Intersection over Union of 76.15% with a forward inference speed of 71 frames per second.This allows for an effective balance between segmentation accuracy and inference speed,all while reducing the number of required model parameters. |