| Semantic segmentation belongs to one of the categories of scene understanding,which is a fundamental but challenging task in the field of computer vision.In recent years,with the development of deep learning technology,deep convolutional neural networks have shown outstanding performance in semantic segmentation.However,some state-of-the-art semantic segmentation methods based on deep convolutional neural networks suffer from high computational complexity and time consumption due to the use of complex network architectures,which greatly limits their application in the real-world scenes requiring fast processing speed.Therefore,research on deep learning based real-time high-performance semantic segmentation is both of practical signifcance and challenging.The main works in this thesis are summarized as follows:(1)We propose a two-path based real-time high-performance semantic segmentation method,which achieves a good trade-off between segmentation accuracy and inference speed.Specifically,the lightweight baseline network with atrous convolution and attention is firstly used as our basic feature extraction network to efficiently obtain dense feature maps.Then,the novel distinctive atrous spatial pyramid pooling which exploits the different sizes of pooling operations to encode rich and distinctive contextual semantic information is developed to capture multi-scale objects.Meanwhile,a spatial detail-preserving network with several shallow convolutional layers is designed to generate high-resolution feature maps preserving the detailed spatial information.Finally,a simple but practical feature fusion network is used to effectively combine both deep and shallow features from the semantic path and the spatial path,respectively.By only using a single NVIDIA TITAN X card,the proposed method respectively achieves the testing accuracy of 73.6%and 68.0%mean intersection over union with the inference speeds of 51.0 and 39.3 frames per second on the Cityscapes and CamVid datasets.(2)We propose a mixed multi-path based real-time high-performance semantic segmentation method,which greatly improves the accuracy while keeping the real-time performance.Specifically,we firstly select the lightweight residual neural network ResNet-18 as our basic feature extraction network to efficiently obtain different sizes of feature maps corresponding to different downsampling stages.Then,the feature maps from different stages in the basic feature extraction network are fed to different branch paths for processing,where each branch corresponds to a specific scale.Such a way improves the feature extraction ability while solving the multi-scale problems.Moreover,the top branch in our multi-path network uses a relatively fine low-level feature map,so the structure also plays a role in preserving detailed spatial information.At the same time,in order to further improve the performance of the network,we apply different residual modules on each path,and use a global pooling layer to obtain the global context on the minimum output feature maps.Finally,we use the feature transformation module to transform and fuse multiple features to obtain the final prediction results.By only using a single NVIDIA TITAN X card,the proposed method achieves the testing accuracy of 74.8%mean intersection over union with the inference speed of 51.4 frames per second on the Cityscapes datasets. |