Image semantic segmentation is one of the basic research topics in the field of pattern recognition,but it is difficult and challenging.Its goal is to label and classify each pixel in the input image.At present,most semantic segmentation networks are based on convolutional neural networks,which build complex networks through the superposition of a large number of convolutional layers.Although this method is simple and easy to implement,and the obtained network segmentation performance is excellent,it is usually accompanied by huge network computational loss and inference delay,so it is not suitable for hardware devices with limited computing resources and storage resources,making it difficult to generalize in real-world application scenarios.In response to the above problems,this paper studies a lightweight real-time semantic segmentation network that takes into account efficiency,accuracy and small memory ratio based on Deep Convolutional Neural Network.The specific research contents are as follows:(1)In order to effectively fuse multi-scale contextual information,a real-time semantic segmentation network(CFSNet)based on context aggregation is designed.The network is based on a symmetric encoder-decoder structure,including spatial attention modules,asymmetric Convolution module,multi-branch asymmetric convolution module and channel attention module.At the same time,three injection branches are designed,and the original images of different scales are processed and injected into the backbone network.On a single RTX2080 Ti GPU,CFSNet can infer at a speed of 76.9FPS on the Cityscapes dataset,with a segmentation accuracy of up to 71.5% m Io U;on the Camvid dataset,it can infer at a speed of 88.9FPS,with a segmentation accuracy of up to 68.8%m Io U,and its parameter amount is only 0.69 M.(2)Aiming at the problem that a lot of spatial information is consumed in the process of highlevel semantic information extraction,an efficient multi-scale context aggregation network(EMCANet)is proposed.The network is based on the Decomposition Extended Convolutional Module and the Three-Branch Decomposition Extended Convolutional Module,and the first Decomposition Extended Convolutional Module and the Three-Branch Decomposition Extended Convolutional Module incorporate a dense connection connection method to expand the receiving domain of the convolutional kernel on the premise of keeping the number of network computations and parameters unchanged,and can collect multi-scale contextual information.Three detail information branches are added to the network,which can make up for the spatial information lost in the process of semantic information extraction.The overall model size of EMCANet is only 0.78 M.When the hardware platform is RTX2080 Ti GPU,its inference speed on the Cityscapes dataset can reach 78.7FPS,and the segmentation accuracy is 71.9% m Io U;the inference speed on the Camvid dataset can reach 85.4FPS,and the segmentation accuracy is 69.4% m Io U.(3)Aiming at the problem that pure convolutional neural networks can only model the dependencies of local features,an efficient real-time semantic segmentation network(EBSSNet)integrating convolutional neural networks and Transformer is proposed,in which the Transformer part can model the dependencies of global features.The basic feature extraction unit in the network is a bilateral feature extraction unit,and three bilateral feature extraction modules are built based on this unit,which are combined with deep separable convolution,depth-by-depth convolution,Transformer attention module and a detail information branch.The parameter quantity of EMCANet is only 0.76 M,and on a single RTX3090 Ti GPU,it can infer the images in the Cityscapes dataset at a speed of 81.6FPS,and the segmentation accuracy is 71.6% m Io U;on the Camvid dataset The inference speed is 90.5FPS,and the segmentation accuracy is 69.0% m Io U.To sum up,this paper proposes three lightweight real-time semantic segmentation networks,which integrate convolutional neural networks and Transformer attention methods,and use spatial attention methods,channel attention methods,extended convolution,depth-by-depth convolution,and depth-separable convolution.Common lightweight image processing strategies such as convolution enable the network to operate with high efficiency,low memory,and high performance.A large number of experiments have shown that these networks have good actual performance and can be used in real hardware systems. |