| Environmental perception plays a pivotal role in visual information processing tasks in the domain of autonomous driving.As an important part of the environmental perception task,the road scene semantic segmentation method recognizes and analyzes the acquired visual data,analyzes the structural information of each class in the image,and divides the image according to the semantic class.Accurately performing pixel-by-pixel segmentation on complex road scenes has proved to be a challenging task for traditional image segmentation methods.However,with the rapid advancement of deep learning technology and hardware equipment,convolutional neural network methods have become the dominant approach for computer vision tasks such as image semantic segmentation.Convolutional neural networks have outstanding extraction effects on high-dimensional features such as spatial positions,edge contours,and target attributes in images,so they have attracted extensive attention from researchers.Convolutional neural network-based road scene segmentation methods can be broadly categorized into two main categories: encoding-decoding methods and parallel structure methods.The encoding-decoding method has a simple structure but will lose detailed information.The parallel structure method can obtain multi-scale feature maps,but it ignores shallow features to some extent.Due to the high accuracy demands for environment perception tasks,this paper primarily focuses on addressing the challenges related to the loss of detailed information and blurred segmentation of image edges.Multi-scale attention fusion network and local aggregation Token network are used to accurately segment the important features and object edges in road scenes.The main research objectives of this paper are as follows:(1)Aiming at the problem of the loss of detailed features caused by multiple convolution and pooling operations in the feature extraction network of the current mainstream road scene segmentation models,a semantic segmentation model based on a multi-scale attention fusion method is proposed in this paper.A dilated convolution iterative aggregation module with is introduced to alleviate the loss of detailed features in the feature extraction stage.Moreover,a multi-scale attention fusion module is proposed to capture the dependency relationships among global attention,spatial attention,and channel attention,thereby enhancing the model’s contextual awareness.(2)Aiming to address the issues of existing road scene semantic segmentation models,such as weak context information extraction ability,and densely predicted local detail information modeling,a novel end-to-end semantic segmentation model based on local aggregation Token decoding method is proposed.First of all,the model receptive field is expanded through the densely connected dilated convolution concatenating module,and the context awareness ability is improved.Secondly,a local aggregation Token decoding module that pays local attention to detailed features is proposed,which avoids network degradation through multiple residual connections.The proposed method models the detailed features in the mapping space,which can effectively alleviate the problem of blurred boundaries between different classes in road scene segmentation.To validate the effectiveness of the above two models,this paper conducts ablation experiments and comparative experiments on the public datasets CamVid,Cityscapes,and MFNet.The model effectively improves the accuracy and real-time performance of road scene image segmentation tasks,and achieves balanced performance in terms of segmentation accuracy and prediction time. |